Analyzing moderately large data sets


 Francis Freeman
 1 years ago
 Views:
Transcription
1 Analyzing moderately large data sets Thomas Lumley R Core Development Team user Rennes
2 Problem Analyzing a data set that contains N observations on P variables, where N P is inconveniently large (from 100Mb on small computers up to many Gb). I will mostly assume that any tasks can be broken down into pieces that use only p variables, where p is small. Since N is assumed large, only tasks with complexity (nearly) linear in N are feasible. Mixture of concrete implementations for linear models, general ideas.
3 Outline Technologies Connections netcdf SQL Memory profiling Strategies Load only the necessary variables Incremental boundedmemory algorithms Randomized subset algorithms Outsource the computing to SQL
4 Examples Linear regression and subset selection on large data sets (large N, small P ) Analysis of large surveys (N , P 500, p 10 ) Wholegenome association studies (N 5000, P 10 6, p 1) Wholegenome data cleaning (P 5000, N 10 6, p 1)
5 File connections file() opens a connection to a file. read.table() and friends can read incrementally from the connection read 1000 records process them lather, rinse, repeat Examples: summaryrprof(), read.fwf() Chunk size should be small enough not to cause disk thrashing, large enough that the fixed costs of read.table() don t matter. Empirically, performance is constant over a fairly wide range.
6 File connections Since read.table() is the bottleneck, make sure to use it efficiently Specify correct types in colclasses= argument to get faster conversions Specify "NULL" for columns that you don t need to read.
7 netcdf netcdf is an arraystructured storage format developed for atmospheric science research (HDF5 from NCSA is similar, but R support is not as good). Allows efficient reading of rows, columns, layers, general hyperrectangles of data. We have used this for wholegenome genetic data. Martin Morgan (FHCRC) reports that this approach to wholegenome data scales to a cluster with hundreds of nodes.
8 Large data netcdf was designed by the NSFfunded UCAR consortium, who also manage the National Center for Atmospheric Research. Atmospheric data are often arrayoriented: eg temperature, humidity, wind speed on a regular grid of (x, y, z, t). Need to be able to select rectangles of data eg range of (x, y, z) on a particular day t. Because the data are on a regular grid, the software can work out where to look on disk without reading the whole file: efficient data access.
9 WGA Array oriented data (position on genome, sample number) for genotypes, probe intensities. Potentially very large data sets: 2,000 people 300,000 = tens of Gb 16,000 people 1,000,000 SNPs = hundreds of Gb. Even worse after imputation to 2,500,000 SNPs. R can t handle a matrix with more than billion entries even if your computer has memory for it. Even data for one chromosome may be too big.
10 Using netcdf data With the ncdf package: open.ncdf() opens a netcdf file and returns a connection to the file (rather than loading the data) get.var.ncdf() retrieves all or part of a variable. close.ncdf() closes the connection to the file.
11 Dimensions Variables can use one or more array dimensions of a file!%&'()$!"#$ *)+,.')/$ 012,&,/,&)$
12 Example Finding long homozygous runs (possible deletions) nc < open.ncdf("hapmap.nc") ## read all of chromosome variable chromosome < get.var.ncdf(nc, "chr", start=1, count=1) ## set up list for results runs<vector("list", nsamples) for(i in 1:nsamples}{ ## read all genotypes for one person genotypes < get.var.ncdf(nc, "geno", start=c(1,i),count=c(1,1)) ## zero for htzygous, chrm number for hmzygous hmzygous < genotypes!= 1 hmzygous < as.vector(hmzygous*chromosome)
13 Example } ## consecutive runs of same value r < rle(hmzygous) begin < cumsum(r$lengths) end < cumsum(c(1, r$lengths)) long < which ( r$lengths > 250 & r$values!=0) runs[[i]] < cbind(begin[long], end[long], r$lengths[long]) close.ncdf(nc) Notes chr uses only the SNP dimension, so start and count are single numbers geno uses both SNP and sample dimensions, so start and count have two entries. rle compresses runs of the same value to a single entry.
14 Creating netcdf files Creating files is more complicated Define dimensions Define variables and specify which dimensions they use Create an empty file Write data to the file.
15 Dimensions Specify the name of the dimension, the units, and the allowed values in the dim.def.ncdf function. One dimension can be unlimited, allowing expansion of the file in the future. An unlimited dimension is important, otherwise the maximum variable size is 2Gb. snpdim<dim.def.ncdf("position","bases", positions) sampledim<dim.def.ncdf("seqnum","count",1:10, unlim=true)
16 Variables Variables are defined by name, units, and dimensions varchrm < var.def.ncdf("chr","count",dim=snpdim, missval=1, prec="byte") varsnp < var.def.ncdf("snp","rs",dim=snpdim, missval=1, prec="integer") vargeno < var.def.ncdf("geno","base",dim=list(snpdim, sampledim), missval=1, prec="byte") vartheta < var.def.ncdf("theta","deg",dim=list(snpdim, sampledim), missval=1, prec="double") varr < var.def.ncdf("r","copies",dim=list(snpdim, sampledim), missval=1, prec="double")
17 Creating the file The file is created by specifying the file name ad a list of variables. genofile<create.ncdf("hapmap.nc", list(varchrm, varsnp, vargeno, vartheta, varr)) The file is empty when it is created. Data can be written using put.var.ncdf(). Because the whole data set is too large to read, we might read raw data and save to netcdf for one person at a time. for(i in 1:4000){ geno<readrawdata(i) ## somehow put.var.ncdf(genofile, "geno", genc, start=c(1,i), count=c(1,1)) }
18 SQL Relational databases are the natural habitat of many large data sets. They are good at Data management Merging data sets Selecting subsets Simple data summaries Relational databases all speak Structured Query Language (SQL), which is more or less standardized. Most of the complications of SQL deal with transaction processing, which we don t have. Initially we care about the SELECT statement.
19 SQL SELECT var1, var2 FROM table WHERE condition SELECT sum(var1) as svar1, sum(var2) as svar2 FROM table GROUP BY category SELECT * FROM table limit 1 SELECT count(*) FROM table SELECT count(*) FROM table GROUP BY category
20 SQL DBI package defines an interface to databases, implemented by eg RSQLite. RODBC defines a similar but simpler interface to ODBC (predominantly a Windows standard interface). SQLite (http://www.sqlite.org), via RSQLite or sqldf, is the easiest database to use. It isn t suitable for very large databases or for large numbers of simultaneous users, but it is designed to be used as embedded storage for other programs. Gabor Grothendieck s sqldf package makes it easier to load data into SQLite and manipulate it in R.
21 DBI/RSQLite library(rsqlite) sqlite < dbdriver("sqlite") conn < dbconnect(sqlite, "~/CHIS/brfss07.db") dbgetquery(conn, "select count(*) from brfss") results < dbsendquery(conn, "select HSAGEIR from brfss") fetch(results,10) fetch(results,10) dbclearresult(result) Change the dbdriver() call to connect to a different database program (eg Postgres).
22 Memory profiling When memory profiling is compiled into R tracemem(object) reports all copying of object Rprof(,memory.profling=TRUE) writes information on duplications, large memory allocations, new heap pages at each tick of the timesampling profiler Rprofmem() is a pure allocation profiler: it writes a stack trace for any large memory allocation or any new heap page. Bioconductor programmers have used tracemem() to cut down copying for single large data objects. Rprofmem() showed that foreign::read.dta() had a bottleneck in as.data.frame.list(), which copies each column 10 times.(!)
23 Loadondemand Use case: large surveys. The total data set size is large (eg N 10 5, P 10 3 ), but any given analysis uses only p 10 variables. Loading the whole data set is feasible on 64bit systems, but not on my laptop. We want an R object that represents the data set and can load the necessary variables when they are needed. Difficult in general: data frames work because they are specialcased by the internal code, especially model.frame() Formula + data.frame interface is tractable, as is formula + expression.
24 Basic idea dosomething < function(formula, database){ varlist < paste( all.vars(formula), collapse=", ") query < paste("select",varlist,"from",database$tablename) dataframe < dbgetquery(database$connection, query) ## now actually do Something fitmodel(formula, dataframe) } First construct a query to load all the variables you need, then submit it to the database to get the data frame, then proceed as usual. Refinements: some variables may be in memory, not in the database, we may need to define new variables, we may want to wrap an existing set of code.
25 Wrapping existing code Define a generic function to dispatch on the second (data) argument dosomething < function(formula, data,...){ UseMethod("doSomething", data) } and set the existing function as the default method dosomething.database < (formula, database,...){ varlist < paste( all.vars(formula), collapse=", ") query < paste("select",varlist,"from",database$tablename) dataframe < dbgetquery(database$connection, query) ## now actually do Something dosomething(formula, dataframe,...) }
26 Allowing variables in memory To allow the function to pick up variables from memory, just restrict the database query to variables that are in the database dbvars < names(dbgetquery(conn, "select * from table limit 1")) formulavars < all.vars(formula) varlist < paste( setdiff(formulavars, dbvars), collapse=", ") [In practice we would find the list of names in the database first and cache it in an R object] Now model.frame() will automatically pick up variables in memory, unless they are masked by variables in the database table the same situation as for data frames.
27 Allowing updates Three approaches: Write new variables into the database with SQL code: needs permission, reference semantics, restricted to SQL syntax Create new variables in memory and save to the database: needs permission, reference semantics, high network traffic Store the expressions and create new variables on data load: wasted effort The third strategy wins when data loading is the bottleneck and the variable will eventually have to be loaded into R.
28 Design A database object stores the connection, table name, new variable information New variables are created with the update method mydata < update(mydata, avgchol = (chol1 + chol2)/2, hibp = (systolic>140) (diastolic>90) ) An expression can use variables in the database or previously defined ones, but not simultaneously defined ones. Multiple update()s give a stack of lists of expressions Use all.vars going down the stack to find which variables to query Return up the stack making the necessary transformations with eval() Implemented in survey, mitools packages.
29 Object structure survey.design objects contain a data frame of variables, and a couple of matrices of survey design metadata. DBIsvydesign contain the survey metadata and database information: connection, table name, driver name, database name. All analysis functions are already generic, dispatching on the type of survey metadata (replicate weights vs strata and clusters). Define methods for DBIsvydesign that load the necessary variables into the $variables slot and call NextMethod to do the work Ideally we would just override model.frame(), but it is already generic on its first argument, probably a mistake in hindsight.
30 Object structure > svymean function (x, design, na.rm = FALSE,...) {.svycheck(design) UseMethod("svymean", design) } > survey:::svymean.dbisvydesign function (x, design,...) { design$variables < getvars(x, design$db$connection, design$db$tablename, updates = design$updates) NextMethod("svymean", design) }
31 Incremental algorithms: biglm Linear regression takes O(np 2 +p 3 ) time, which can t be reduced easily (for large p you can replace p 3 by p log 2 7, but not usefully). The R implementation takes O(np+p 2 ) memory, but this can be reduced dramatically by constructing the model matrix in chunks. Compute X T X and X T y in chunks and use ˆβ = (X T X) 1 X T y Compute the incremental QR decomposition of X to get R and Q T Y, solve Rβ = Q T y The second approach is slightly slower but more accurate. Coding it would be a pain, but Alan Miller has done it.(applied Statistics algorithm 274) Fortran subroutine INCLUD in biglm and in leaps updates R and Q T Y by adding one new observation.
32 Incremental algorithms: biglm The only new methodology in biglm is an incremental computation for the Huber/White sandwich variance estimator. The middle term in the sandwich is n i=1 x i (y i x iˆβ) 2 x T i and since ˆβ is not known in advance this appears to require a second pass through the data. In fact, we can accumulate a (p+1) 2 (p+1) 2 matrix of products of x and y that is sufficient to compute the sandwich estimator without a second pass.
33 biglm: interface biglm() uses the traditional formula/ data.frame interface to set up a model update() adds a new chunk of data to the model. The model matrices have to be compatible, so datadependent terms are not allowed. Factors must have their full set of levels specified (not necessarily present in the data chunk), splines must use explicitly specified knots, etc. One subtle issue is checking for linear dependence. This can t be done until the end, because the next chunk of data could fix the problem. The checking is done by the coef() and vcov() methods.
34 bigglm Generalized linear models require iteration, so the same data has to be used repeatedly. The bigglm() function needs a data source that can be reread during each iteration. The basic code is in bigglm.function(). It iterates over calls to biglm, and uses a data() function with one argument data(reset=false) supplies the next chunk of data or NULL or a zerorow data frame if there is no more. data(reset=true) resets to the start of the data set, for the next iteration.
35 bigglm and SQL One way to write the data() function is to load data from a SQL query. The query needs to be prepared once, in the same way as for loadondemand databases. modelvars < all.vars(formula) dots < as.list(substitute(list(...)))[1] dotvars < unlist(lapply(dots, all.vars)) vars < unique(c(modelvars, dotvars)) tablerow < dbgetquery(data, paste("select * from ", tablename, " limit 1")) tablevars < vars[vars %in% names(tablerow)] query < paste("select ", paste(tablevars, collapse = ", "), " from ", tablename)
36 bigglm and SQL The query is resent when reset=true and otherwise a chunk of the result set is loaded. chunk < function(reset = FALSE) { if (reset) { if (got > 0) { dbclearresult(result) result << dbsendquery(data, query) got << 0 } return(true) } rval < fetch(result, n = chunksize) got << got + nrow(rval) if (nrow(rval) == 0) return(null) return(rval) }
37 bigglm iterations If p is not too large and the data are reasonably wellbehaved so that the loglikelihood is wellapproximated by a quadratic, three iterations should be sufficient and good starting values will cut this to two iterations or even to one. Good starting values may be available by fitting the model to a subsample. If p is large or the data are sparse it may be preferable to use a fixed small number of iterations as a regularization technique.
38 Subset selection and biglm Forward/backward/exhaustive search algorithms for linear models need only the fitted QR decomposition or matrix of sums and products, so forward search is O(np 2 + p 3 + p 2 ) and exhaustive search is O(np 2 + p p ) In particular, the leaps package uses the same AS274 QR decomposition code and so can easily work with biglm The subset selection code does not have to fit all the regression models, only work out the residual sum of squares for each model. The user has to fit the desired models later; the package has vcov() and coef() methods.
39 Subset selection and glm For a generalized linear model it is not possible to compute the loglikelihood without a full model fit, so the subset selection shortcuts do not work. They do work approximately, which is good enough for many purposes. Fit the maximal model with bigglm Do subset selection on the QR decomposition object, which represents the last iteration of weighted least squares, returning several models of each size (eg nbest=10) Refit the best models with bigglm to compute the true loglikelihood, AIC, BIC, CIC, DIC,...,MIC(key mouse). joke (c) Scott Zeger
40 Random subsets Even when analyzing a random subset of the data is not enough for final results it can be a useful step. Fitting a glm to more than N observations and then taking one iteration from these starting values for the full data is asymptotically equivalent to the MLE. Computing a 1 ɛ confidence interval for the median on a subsample, followed up by one pass through the data will be successful with probability 1 ɛ; otherwise repeat until successful. The same trick works for quantile regression. Averaging an estimator over a lot of random subsamples will be valid and close to efficient for any nconsistent estimator if the subsamples are large enough for the firstorder term in the sampling error to dominate.
41 Median Take a subset at random of size n N Construct a 99.99% confidence interval (l, u) for the median. The interval will cover about 2/ n fraction of your data. Take one pass through the data, counting how many observations m l are below l and how many m u above u, and loading any that are between l and u. This takes about 2N/ n memory If m l < N/2 and m u < N/2 the median is the N/2 m l th element of those loaded into memory, and this happens with probability 99.99%. Otherwise repeat until it works (average number of trials is ).
42 Median Total memory use is n + 2N/ n, optimized by n = N 2/3. With N = 10 6, n = 10 4, ie, 99% reduction If the data are already in random order we can use the first n for the random subsample, so a total of only one pass is needed. Otherwise need two passes. Potentially large improvement over sorting, which takes lg N lg n passes to sort N items with working memory that holds n. For quantile regression the same idea works with n = (Np) 2/3. See Roger Koenker s book.
43 Twophase sampling For survival analysis and logistic regression, can do much better than random sampling. When studying a rare event, most of the information comes from the small number of cases. The case cohort design uses the cases and a stratified random sample of noncases, to achieve nearly full efficiency at much reduced computational effort. Used in analyzing large national health databases in Scandinavia. Analyse using cch() in the survival package.
44 Outsourcing to SQL SQL can do addition and multiplication within a record and sums across records This means it can construct the design matrix X and X T X and X T y for a linear regression of y on X R can then do the last step of matrix inversion and multiplication to compute ˆβ. This approach is boundedmemory and also bounded data transfer (if the database is on a remote computer). There is no upper limit on data set size except limits on the database server.
45 Model frame and model matrix The structure of the design matrix does not depend on the data values and so can be computed by model.frame() and model.matrix() even from a zerorow version of the data. Current implementation allows only contr.treatment contrasts, and interactions are allowed only between factor variables. Model frame and model matrix are created as new tables in the database (views might be better), and a finalizer is registered to drop the tables when the objects are garbagecollected. Subsetting is handled by using zero weight for observations outside the subset. Not the most efficient implementation, but never modifies tables and has passbyvalue semantics.
46 Model frame and model matrix > source("sqlm.r") > sqlapi<sqldataset("api.db",table.name="apistrat",weights="pw") Loading required package: DBI Warning message: In is.na(rows) : is.na() applied to non(list or vector) of type NULL > s1<sqllm(api00~api99+emer+stype,sqlapi) > s1 _api99 _emer stypeh stypem _Intercept_ attr(,"var") _api99 _emer stypeh stypem _Intercept api _emer stypeh stypem _Intercept_ > sqrt(diag(attr(s1,"var"))) _api99 _emer stypeh stypem _Intercept_
47 Model frame and model matrix > dblisttables(sqlapi$conn) [1] "_mf_10d63af1" "_mm_60b7acd9" "apiclus1" "apiclus2" "apistrat" > gc() used (Mb) gc trigger (Mb) max used (Mb) Ncells Vcells > dblisttables(sqlapi$conn) [1] "apiclus1" "apiclus2" "apistrat"
48 Other models Chen & Ripley (DSC 2003) described how to use this approach more generally: the iterative optimization for many models can be written as arithmetic on large data sets and parameters to create a small update vector, followed by a small amount of computation to get new parameter values. For example, fitting a glm by IWLS involves repeated linear model fits, with the working weights and working response recomputed after each fit. In a dialect of SQL that has EXP() the working weights and working response can be computed by a select statement.
Approximately Detecting Duplicates for Streaming Data using Stable Bloom Filters
Approximately Detecting Duplicates for Streaming Data using Stable Bloom Filters Fan Deng University of Alberta fandeng@cs.ualberta.ca Davood Rafiei University of Alberta drafiei@cs.ualberta.ca ABSTRACT
More informationRobust Set Reconciliation
Robust Set Reconciliation Di Chen 1 Christian Konrad 2 Ke Yi 1 Wei Yu 3 Qin Zhang 4 1 Hong Kong University of Science and Technology, Hong Kong, China 2 Reykjavik University, Reykjavik, Iceland 3 Aarhus
More informationScalable Collaborative Filtering with Jointly Derived Neighborhood Interpolation Weights
Seventh IEEE International Conference on Data Mining Scalable Collaborative Filtering with Jointly Derived Neighborhood Interpolation Weights Robert M. Bell and Yehuda Koren AT&T Labs Research 180 Park
More informationIEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, 2013. ACCEPTED FOR PUBLICATION 1
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, 2013. ACCEPTED FOR PUBLICATION 1 ActiveSet Newton Algorithm for Overcomplete NonNegative Representations of Audio Tuomas Virtanen, Member,
More informationGetting Started with SAS Enterprise Miner 7.1
Getting Started with SAS Enterprise Miner 7.1 SAS Documentation The correct bibliographic citation for this manual is as follows: SAS Institute Inc 2011. Getting Started with SAS Enterprise Miner 7.1.
More informationCumulus: Filesystem Backup to the Cloud
Cumulus: Filesystem Backup to the Cloud Michael Vrable, Stefan Savage, and Geoffrey M. Voelker Department of Computer Science and Engineering University of California, San Diego Abstract In this paper
More informationAnatomy of a Database System
Anatomy of a Database System Joseph M. Hellerstein and Michael Stonebraker 1 Introduction Database Management Systems (DBMSs) are complex, missioncritical pieces of software. Today s DBMSs are based on
More informationIntroduction to Data Mining and Knowledge Discovery
Introduction to Data Mining and Knowledge Discovery Third Edition by Two Crows Corporation RELATED READINGS Data Mining 99: Technology Report, Two Crows Corporation, 1999 M. Berry and G. Linoff, Data Mining
More informationBenchmarking Cloud Serving Systems with YCSB
Benchmarking Cloud Serving Systems with YCSB Brian F. Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, Russell Sears Yahoo! Research Santa Clara, CA, USA {cooperb,silberst,etam,ramakris,sears}@yahooinc.com
More informationPackage RcmdrPlugin.DoE
Version 0.123 Date 20140828 Package RcmdrPlugin.DoE February 19, 2015 Title R Commander Plugin for (industrial) Design of Experiments Maintainer Depends R (>= 2.10.0),
More informationThe InStat guide to choosing and interpreting statistical tests
Version 3.0 The InStat guide to choosing and interpreting statistical tests Harvey Motulsky 19902003, GraphPad Software, Inc. All rights reserved. Program design, manual and help screens: Programming:
More informationLow Overhead Concurrency Control for Partitioned Main Memory Databases
Low Overhead Concurrency Control for Partitioned Main Memory bases Evan P. C. Jones MIT CSAIL Cambridge, MA, USA evanj@csail.mit.edu Daniel J. Abadi Yale University New Haven, CT, USA dna@cs.yale.edu Samuel
More informationEfficient Processing of Joins on Setvalued Attributes
Efficient Processing of Joins on Setvalued Attributes Nikos Mamoulis Department of Computer Science and Information Systems University of Hong Kong Pokfulam Road Hong Kong nikos@csis.hku.hk Abstract Objectoriented
More informationAbstract. 1. Introduction. Butler W. Lampson Xerox Palo Alto Research Center David D. Redell Xerox Business Systems
Experience with Processes and Monitors in Mesa 1 Abstract Butler W. Lampson Xerox Palo Alto Research Center David D. Redell Xerox Business Systems The use of monitors for describing concurrency has been
More informationAn Introduction to R. W. N. Venables, D. M. Smith and the R Core Team
An Introduction to R Notes on R: A Programming Environment for Data Analysis and Graphics Version 3.2.0 (20150416) W. N. Venables, D. M. Smith and the R Core Team This manual is for R, version 3.2.0
More informationPRINCIPAL COMPONENT ANALYSIS
1 Chapter 1 PRINCIPAL COMPONENT ANALYSIS Introduction: The Basics of Principal Component Analysis........................... 2 A Variable Reduction Procedure.......................................... 2
More informationHow To Write Fast Numerical Code: A Small Introduction
How To Write Fast Numerical Code: A Small Introduction Srinivas Chellappa, Franz Franchetti, and Markus Püschel Electrical and Computer Engineering Carnegie Mellon University {schellap, franzf, pueschel}@ece.cmu.edu
More informationAre Automated Debugging Techniques Actually Helping Programmers?
Are Automated Debugging Techniques Actually Helping Programmers? Chris Parnin and Alessandro Orso Georgia Institute of Technology College of Computing {chris.parnin orso}@gatech.edu ABSTRACT Debugging
More informationDetecting LargeScale System Problems by Mining Console Logs
Detecting LargeScale System Problems by Mining Console Logs Wei Xu Ling Huang Armando Fox David Patterson Michael I. Jordan EECS Department University of California at Berkeley, USA {xuw,fox,pattrsn,jordan}@cs.berkeley.edu
More informationTop 10 Most Common Java Performance Problems
Top 10 Most Common Java Performance Problems Top 10 Most Common Java Performance Problems Table of Contents Introduction... 3 Database... 4 1. Death by 1,000 Cuts: The Database N+1 Problem... 5 2. Credit
More informationON THE DISTRIBUTION OF SPACINGS BETWEEN ZEROS OF THE ZETA FUNCTION. A. M. Odlyzko AT&T Bell Laboratories Murray Hill, New Jersey ABSTRACT
ON THE DISTRIBUTION OF SPACINGS BETWEEN ZEROS OF THE ZETA FUNCTION A. M. Odlyzko AT&T Bell Laboratories Murray Hill, New Jersey ABSTRACT A numerical study of the distribution of spacings between zeros
More informationThe Set Data Model CHAPTER 7. 7.1 What This Chapter Is About
CHAPTER 7 The Set Data Model The set is the most fundamental data model of mathematics. Every concept in mathematics, from trees to real numbers, is expressible as a special kind of set. In this book,
More informationWhat s the Difference? Efficient Set Reconciliation without Prior Context
hat s the Difference? Efficient Set Reconciliation without Prior Context David Eppstein Michael T. Goodrich Frank Uyeda George arghese,3 U.C. Irvine U.C. San Diego 3 ahoo! Research ABSTRACT e describe
More informationWTF: The Who to Follow Service at Twitter
WTF: The Who to Follow Service at Twitter Pankaj Gupta, Ashish Goel, Jimmy Lin, Aneesh Sharma, Dong Wang, Reza Zadeh Twitter, Inc. @pankaj @ashishgoel @lintool @aneeshs @dongwang218 @reza_zadeh ABSTRACT
More informationHow To Write Shared Libraries
How To Write Shared Libraries Ulrich Drepper drepper@gmail.com December 10, 2011 1 Preface Abstract Today, shared libraries are ubiquitous. Developers use them for multiple reasons and create them just
More informationJCR or RDBMS why, when, how?
JCR or RDBMS why, when, how? Bertil Chapuis 12/31/2008 Creative Commons Attribution 2.5 Switzerland License This paper compares java content repositories (JCR) and relational database management systems
More informationRecommendation Systems
Chapter 9 Recommendation Systems There is an extensive class of Web applications that involve predicting user responses to options. Such a facility is called a recommendation system. We shall begin this
More informationHints for Computer System Design 1
Hints for Computer System Design 1 Butler W. Lampson Abstract Computer Science Laboratory Xerox Palo Alto Research Center Palo Alto, CA 94304 Studying the design and implementation of a number of computer
More informationEfficient set intersection for inverted indexing
Efficient set intersection for inverted indexing J. SHANE CULPEPPER RMIT University and The University of Melbourne and ALISTAIR MOFFAT The University of Melbourne Conjunctive Boolean queries are a key
More informationEfficient Algorithms for Sorting and Synchronization Andrew Tridgell
Efficient Algorithms for Sorting and Synchronization Andrew Tridgell A thesis submitted for the degree of Doctor of Philosophy at The Australian National University February 1999 Except where otherwise
More information