Querying about the Past, the Present, and the Future in SpatioTemporal Databases


 Randolph Morton
 1 years ago
 Views:
Transcription
1 Querying aout the Past, the Present, and the Future in SpatioTemporal Dataases Jimeng Sun Dimitris Papadias Yufei Tao Bin Liu Department of Computer Science Carnegie Mellon University Pittsurgh, PA, USA Department of Computer Science Hong Kong University of Science and Technology Clear Water Bay, Hong Kong {dimitris, Department of Computer Science City University of Hong Kong Tat Chee Avenue, Hong Kong Astract Moving ojects (e.g., vehicles in road networks) continuously generate large amounts of spatiotemporal information in the form of data streams. Efficient management of such streams is a challenging goal due to the highly dynamic nature of the data and the need for fast, online computations. In this paper we present a novel approach for approximate query processing aout the present, past, or the future in spatiotemporal dataases. In particular, we first propose an incrementally updateale, multidimensional histogram for presenttime queries. Second, we develop a general architecture for maintaining and querying historical data. Third, we implement a stochastic approach for predicting the results of queries that refer to the future. Finally, we experimentally prove the effectiveness and efficiency of our techniques using a realistic simulation.. Introduction Research on spatiotemporal access methods has mainly focused on two aspects: (i) storage and retrieval of historical information, (ii) future prediction. Several indexes [PJT, KGT+, TP], usually ased on multiversion or 3dimensional variations of Rtrees, have een proposed towards the first goal, aiming at minimization of the storage requirements and query cost. Methods for future prediction assume that, in addition to the current positions, the velocities of moving ojects are known. The goal is to retrieve the ojects that satisfy a spatial condition at a future timestamp (or interval) given their present motion vectors (e.g., "ased on the current information, find the cars that will e in the city center minutes from now"). The only practical index in this category is the TPRtree [SJLL] and its variations [SJ2, TPS3] (also ased on Rtrees). Other, mostly theoretical, predictive indexes have een proposed in [KGT99] (applicale only to D space) and [AAE] (with good asymptotical performance, ut inapplicale in practice due to the large hidden constants).. Motivation Despite the large numer of methods that focus explicitly on historical information retrieval or future prediction, currently there does not exist a single index that can achieve oth goals. Even if such a "universal" structure existed (e.g., a multiversion TPRtree keeping all previous history of each oject), it would e inapplicale for several updateintensive applications, where it is simply infeasile to continuously update the index and at the same time process queries. For instance, an update (i.e., deletion and reinsertion) in a TPRtree may need to access more than nodes, which means that y the time it terminates its result may already e outdated (due to another update of the same oject). Even for a small numer of moving ojects and a low update rate, the TPRtree (or any other index) cannot "follow" the fast changes of the underlying data. Another prolem concerns the space requirements. The incoming data are usually in the form of data streams (e.g., through sensors emedded on the road network), which are potentially unounded in size. Therefore, materializing all data is unrealistic. Furthermore, even if all the data were stored, the size of the tree would render query processing very expensive since any algorithm would have to access at least a complete path from the root to the leaf level. Finally, in several applications, such as traffic control systems, the main focus of query processing is retrieval of approximate summarized information aout ojects that satisfy some spatiotemporal predicate (e.g., " the numer of cars in the city center minutes from now"), as opposed to exact information aout the qualifying ojects (i.e., the car ids), which may e unavailale, or irrelevant..2 Contriutions Motivated y the aove oservations, in this paper we propose a comprehensive method that (i) can approximately answer aggregate spatiotemporal queries aout the past, the present and the future, (ii) can continuously follow the (typically, very frequent) changes in data distriution, (iii) it is efficient and adjustale to the
2 urden of the system, i.e., it does not strain the system, especially during periods of intensive workloads, and (iv) reduces the space consumption y summarizing the data. Our first contriution is an adaptive multidimensional histogram (AMH), which is used for approximate processing of presenttime queries. AMH utilizes idle CPU cycles to perform incremental maintenance. Its precision depends only on the availale system resources, and does not deteriorate with time (meaning the AMH does not need to e periodically reuilt). The second contriution is a general architecture, called the historical synopsis, for maintaining and querying historical data. In this architecture, when a histogram ucket is updated, the previous version is kept in a mainmemory index. If the availale memory is exceeded, part of the index containing the least recent information migrates to the disk. As a result, queries referring to the present and recent past (which are more frequent) can e answered without any I/O. Third we present a stochastic approach, ased on exponential smoothing, that answers queries aout the future using oservations from the present and the recent past. Unlike previous approaches for spatiotemporal prediction, this technique does not assume knowledge of velocity vectors, which in most cases is unavailale. Finally, we experimentally evaluate the proposed techniques under a reallife simulation, where updates and queries are interleaved with histogram maintenance operations. We study the effect of the system workload on the estimation accuracy, and verify the need for approximation methods that adapt to the availale recourses. The rest of the paper is organized as follows. Section 2 reviews previous work related to ours. Section 3 provides the prolem definition and an overview of the proposed methods. Section 4 presents the concrete algorithms of AMH, while Section descries the historical synopsis. Section 6 presents the stochastic approach for prediction, and Section 7 contains the experimental evaluation. Finally, Section 8 concludes the paper with directions for future work. 2. Related Work Because the proposed techniques comine multiple goals, they relate to a numer of different research topics. In particular, since the core of our method is AMH, in Section 2. we survey related work on multidimensional histograms. Section 2.2 presents methods for updating such histograms using query feedack. Section 2.3 discusses other multidimensional approximation techniques. Section 2.4 overviews spatiotemporal prediction techniques and their limitations in several applications. Finally, Section 2. presents methods for (exact) retrieval of spatiotemporal aggregation information. 2. Static multidimensional histograms Histogram construction techniques split the data space into uckets, usually ased on the assumption that data within a small region are (almost) uniform. The various methods differ on the partition policy. The equidepth histogram of [MD88] first partitions along one dimension so that each ucket has the same numer (i.e., frequency) of ojects. Then, these uckets are partitioned again along another dimension. Mhist [PI97] splits on the most critical dimension according to a partitioning metric (e.g., variance). Minskew [APR99] partitions the space into a regular grid. Neighoring cells are then grouped into rectangular, nonoverlapping uckets y a greedy algorithm that tries to minimize the spatialskew (i.e., the variance of the cell density in each ucket). Genhist [GKTD] allows overlapping uckets. As with Minskew, it starts with a regular grid and creates uckets for the cells with high density. Then it removes a percentage of data from these cells to make the area around them smoother. Finally it decreases the resolution and repeats the process recursively. The SQ [AN] and Euler histograms [SAA2], in addition to locations, take into account ojects extents, and are more accurate for nonpoint data. All the aove histograms apply to spatial or multidimensional dataases, where the data are (almost) static. When the data distriution changes due to updates, the ucket structure may ecome outdated, and the histogram must e reuilt. In several spatiotemporal applications, however, there is simply not enough time to scan the entire data (possily several times) in order to uild the histogram from scratch. Furthermore, the new histogram may already e outdated, due to the changes that occurred during its reuilding. 2.2 Queryadaptive multidimensional histograms Query adaptive histograms avoid reuilding y using query feedack to refine the uckets. STGrid [AC99], ased on the heuristic that uckets with higher frequencies contriute more to the estimation error, splits uckets with high frequencies and merges uckets with low frequencies. Restructuring is performed at certain intervals (a parameter of the histogram) for entire rows or columns, selected y a split or a merge threshold (also a parameter). Its structure, however, cannot accurately capture aritrary distriutions. STHoles [BGC] alleviates this prolem y allowing nesting of uckets. If a query identifies large frequency variance within a ucket, a "hole", i.e., a new ucket, is created inside the original one. This "drilling" operation replaces the splitting mechanism of STGrid. SASH [LWV3] decomposes the multidimensional space into suspaces of lower dimensionality. For each suspace a separate histogram is uilt and maintained using query feedack mechanisms.
3 Queryadaptive histograms are ased on the assumption that the frequency of queries is much higher than the frequency of updates. Thus, after the data distriution changes, there is a sufficient numer of queries to "train" (i.e., refine) the histogram. Although the accuracy for these queries will e low, susequent queries will e estimated using the refined histogram, which gradually ecomes very accurate. Clearly, this assumption does not hold for updateintensive applications, where the numer of queries for each instance of the data is, typically, very small. 2.3 Other multidimensional approximation methods Several multidimensional methods emed the data into another domain and summarize the emedded data in order to provide a compact data representation. The DCTased histogram [LKC99] maintains a large numer of small uckets using discrete cosine transform, which cannot e incrementally maintained. The histogram of [MVW98] extracts the most descriptive wavelet coefficients. Although it can e incrementally maintained [MVW], its performance degrades with the numer of updates, so that eventually it has to e reuilt. Another technique [TGIK2] compresses the data distriution into a multidimensional sketch; when the query arrives, the histogram is extracted from the sketch. The main drawack of this approach is that the extracting process is expensive and, therefore, not suitale for online queries. [GMP2] incrementally maintains a small sample set (acking sample) of underlying data and histograms are then constructed/maintained y the acking sample. However, it requires scanning the entire data to regenerate the acking sample when the distriution changes, which is not possile in our case. Furthermore, some other applications of samplingased techniques [GM98, WAA] are also questionale, since it is not clear how to select and effectively maintain a representative multidimensional sample in the presence of very intensive updates. Finally, all previous methods capture a snapshot of the data and cannot e used for historical information retrieval. 2.4 Spatiotemporal prediction methods Methods for spatiotemporal prediction estimate the numer of ojects that will satisfy some spatial condition during a future interval, ased on the current location and velocity information. Choi and Chung [CC2] and Tao et al. [TSP3] present proailistic models for uniform data, which are then applied to nonuniform distriutions using some conventional multidimensional histogram. An experimental comparison of several techniques can e found in [HKT3]. All existing methods assume linear movement and that the velocities of all ojects are known (the same assumptions that hold for the TPRtreeased indexes [SJLL, TPS3]). This restricts their applicaility in several applications ecause: (i) movement is not always linear, (ii) even if it is linear, it is not usually known (e.g., moile devices transmit only their location to the ase station), and (iii) even if it is linear and known, it changes so fast that prediction using velocity information is meaningless (e.g., cars moving on road networks). 2. Spatiotemporal aggregation methods Several methods (e.g., [PTKZ2, ZTG2]) focus on spatiotemporal aggregation. Although the target queries are similar to our historical queries, the context is different. In particular, they assume a "conventional" processing framework where every query invokes disk I/Os and returns an exact answer. Here we sacrifice accuracy for efficiency and attempt to answer the most frequent queries using only the main memory. 3. Prolem Definition and System Overview We assume a set of ojects moving in the twodimensional space and a central server collecting information aout their locations over time. Update data received y the server contain the previous and the new location of individual ojects, ut not their velocity or id. Moving ojects may issue updates in periodic intervals or ased on the deviation from the previous transmitted position. Static ojects do not send information to the server; until an oject transmits a new location it is assumed to e in the last recorded position. We adopt a 2D grid that partitions the data space into w w (where w is a constant called the resolution) regular cells with width /w on each axis. Each cell c ( c w 2 ) is associated with a frequency F c, which is the numer of ojects (at the present time) in its extent. The frequencies of all the cells are maintained uptodate in memory, and serve as the data of the highest granularity. The resolution is determined y the numer of moving ojects in the system, as well as the desirale tradeoff etween accuracy and efficiency. A spatiotemporal aggregation query q(q R,q T ) retrieves the numer of ojects that fall in a rectangle region q R at a timestamp q T. If (i) q T = (i.e., the present time), q constitutes a present timestamp (PT) query; (ii) q T <, q is a historical timestamp (HT) query; and (iii) q T >, q is a future timestamp (FT) query. When the resolution of the grid is high, exhaustive examination of all the cells that are relevant to a query is expensive. AMH (adaptive multidimensional histogram) remedies this prolem y grouping adjacent cells with similar frequencies into a small numer of (rectangular) uckets (similar to other histograms such as minskew [APR99] and genhist [GKTD]). Bucket information is updated as new tuples arrive, and ucket extents (i.e., grouping of cells) continuously adapt to the data distriution changes through incremental maintenance performed during the
4 idle CPU time (i.e., when there are no incoming data or queries). A inary partition tree accelerates ucket maintenance and processing of PT queries. To support historical retrieval, uckets that are outdated (due to data or extent changes) are inserted in a mainmemory index (T) optimized for HT queries. Each ucket is associated with a lifespan denoting the temporal interval during which the ucket information is valid. When the size of the index exceeds the availale memory, part of the tree containing the least recent uckets is migrated to the disk, such that queries aout the recent past can e answered using the memoryresident portion of the index. This migration occurs in locks (rather than individual uckets) at a time, and incurs minimal I/O overhead. AMH together with index T constitute the historical synopsis, whose general structure is shown in Figure 3.. Figure 3.: Historical synopsis The historical synopsis is directly used for answering PT and HT queries. On the other hand, FT queries are processed y an exponential smoothing technique, which retrieves information aout the present and the recent past, in order to predict the future. Figure 3.2 shows the astract query processing mechanism and Tale 3. lists the frequentlyused symols. In the rest of the paper we descrie the various components of the architecture, starting with AMH in Section Adaptive MultiDimensional Histogram Given a regular w w cell partition, AMH generates n rectangular uckets whose edges are aligned with the cell oundaries. For each ucket k ( k n), we denote (i) as n k the numer of cells it covers, (ii) as f k the average frequency of these cells (i.e., f k =(/n k )Σ ( cell in k) F c ), and (iii) as v k their variance (i.e., v k =(/n k )Σ ( cell in k) (F c f k ) 2 ). AMH aims at minimizing the weighted variance sum (WVS) of all the uckets, or formally: WVS= i=~n (n i v i ) (4) 4. AMH structure Each ucket stores: (i) its rectangular extent R (ausing the terminology slightly, we also use R to denote its area), (ii) the average frequency f of all the cells in R, and (iii) the average of squared frequency g of these cells: g=(/n k ) ( cell c in R) (F c ) 2. Thus, the numer n of cells covered y a ucket can e represented as R w 2, where w is the cell resolution. A ucket s variance v equals v=g f 2, and as a result, WVS can e computed using R, f, g of all uckets. AMH maintains a inary partition tree (BPT), where each leaf node corresponds to a ucket. An intermediate node is associated with a rectangular extent R that encloses the extents of its (two) children. Initially, BPT contains a single leaf node, and the histogram has one ucket covering the entire data space. New uckets (leaf nodes) are created through ucket splits (elaorated shortly), ut the total numer n of uckets never exceeds a system parameter B. As an example, Figure 4.a shows the frequencies of 2 cells (i.e., w=) at timestamp, Figure 4. demonstrates the extents of 6 uckets, and Figure 4.c illustrates the corresponding BPT. y y 4 y 3 y 2 y y y 4 y 3 y 2 y Symol w F c (B) n R k n k f k g k v k Figure 3.2: Query processing Description resolution frequency of cell c (maximum) numer of uckets ucket extent or area numer of cells in ucket k frequency (i.e., numer of ojects) in ucket k average squared frequency of ucket k frequency variance of ucket k Tale 3.: Frequent symols x x 2 x 3 x 4 x x x 2 x 3 x 4 x (a) Cell frequency () Bucket extents node R f g v 7 [ x x ][ y y ] /36 [ 8/ /6 9/6 x x 2 ][ y y 2 ] 2/4 2/ [ x 3 x 4 ][ y 4 y ] 3/4 43/4 3/6 4 [ x 3 x 4 ][ y 2 y 3 ] 39/4 383/4 / [ x x ][ y 2 y ] 2/4 /4 3/6 WVS=7.8 [ x 3 x ][ y y ] 3/3 3/3 3 4 (c) The BPT (d) Bucket information Figure 4.: AMH at time
5 Intermediate node 7, for instance, covers uckets, 2, implying that they were created y splitting 7. Figure 4.d lists the detailed information of each ucket (v is not explicitly stored, ut computed from f and g). 4.2 AMH information update A location update contains the old and the new position of a moving oject. When such an update is received, AMH executes an infoupdate process to modify the frequencies of the cells and uckets that cover the corresponding locations. Particularly, a cell can e identified using a hash function, while a ucket is located y following a single path of BPT, guided y the extents of the intermediate nodes. Assume, for example, that the frequency of cell (x,y 2 ) in Figure 4.a changes from 6 to at time 2, due to concurrent movement of several ojects. After modifying the cell frequency, infoupdate otains the frequency (squared frequency) difference 4 (64) etween the new and old values. To complete the update, it identifies ucket 6 covering the cell (following the path,, 9 ), and updates its f and g as f 6 = (f 6 n 6 +4)/n 6, g 6 = (g 6 n 6 +64)/n 6, where n 6 is the numer of cells covered y 6, and f 6 (g 6 ) is their average (square) frequency. Assuming that BPT is fairly alanced, the algorithm (shown in Figure 4.2) has average cost logb, where B is the maximum numer of uckets. Algorithm Infoupdate (x, y, d) /* (x,y) are the cell coordinates, d is the frequency difference */. find the cell c that contains (x, y) using a hash function 2. f=d ; g= (F c +d) 2  F c 2 ; F c =F c +d 3. descend BPT to find the ucket that contains (x,y) 4. f =(R w 2 f + f)/(r w 2 ); g =(R w 2 g + g)/(r w 2 ) // R is the extent area of, w is the cell resolution End Infoupdate Figure 4.2: infoupdate algorithm Updates may alter the frequency distriution, increasing the frequency variances and the approximation error. Figure 4.3a shows the updated cells at time 2, and Figure 4.3 illustrates the information of the corresponding uckets. Notice that v, v are significantly larger than those in Figure 4.d, causing considerale increase of WVS (from 7.8 in Figure 4.d to 73.8). To remedy this prolem, AMH performs ucket reorganization when the CPU is idle (i.e., no data/query is pending), which involves ucket merging and splitting. 4.3 AMH ucket merge A merge is performed when the numer of uckets has reached the maximum value B (assumed to e 6 in these examples), and aims at grouping cells of multiple uckets, whose frequencies have converged since the last reorganization, into a single ucket. Towards this, AMH selects the leaf node of BPT with the lowest merge penalty (MPen); MPen is defined as the increase of WVS (see equation 4) if the node is merged with its siling. Assume, for instance, the cell frequencies in Figure 4.3a, the ucket information in Figure 4.3, and the tree in Figure 4.c. Merging ucket 3 with its siling 4 removes oth nodes from BPT and makes 8 a leaf (i.e., a new ucket in the histogram). The change incurs n 8 v 8 (n 3 v 3 +n 4 v 4 ) increase in WVS (i.e., the merge penalty for 3, 4 ), where n i is the numer of cells covered y node i, and v i is its frequency variance. A ucket s siling can sometimes e an intermediate node, e.g., the siling of ucket 6 is 8, in which case MPen should e computed as n 9 v 9 (n 3 v 3 +n 4 v 4 +n 6 v 6 ). Notice that, merging 6 with 8 would reduce the ucket numer y 2 (i.e., removing 3 leaf nodes 3, 4, 6, spawning 9 ). The leaf node with the lowest merge penalty can e identified using a single postorder traversal of BPT s intermediate nodes (i.e., each intermediate node is processed after its children). Consider, for example, Figure 4.c, where node 7 is the first node processed. We compute the following information (from its children, 2 ): (i) the average frequency of all cells in its extent f 7 = (f R + f 2 R 2 )/ R 7, (ii) the average squared frequency g 7 = (g R + g 2 R 2 )/R 7, (iii) the merge penalty of its children MPen 7 = n 7 v 7 (n v +n 2 v 2 ), where n i =R i w 2, v i =g i f 2 i. Next, similar computations are performed for 8, 9,, (in this order). Note that i) the leaf node 3 ( 4, 6 ) is updated from the corresponding cells; ii) the computation for 9 ( ) is ased on the alreadycomputed information of intermediate node 8 ( 9 ). Similarly, the information of is computed y 7 and. Finally, the algorithm merges the children of the node that incurs the smallest penalty (in this case 9 ). Figures 4.4a and 4.4 illustrate the resulting ucket extents and BPT, respectively. Figure 4. shows the pseudocode of the ucketmerge algorithm. Its running time is B, since it visits all nodes of BPT. y y 4 y 3 y 2 y 9 node R f g v 3 9 [ x x 2 ][ y 3 y ] 32/6 284/6 68/ [ x x 2 ][ y y 2 ] 2/4 2/4 8/6 9 3 [ x 3 x 4 ][ y 4 y ] 39/4 38/4 3/ [ x 3 x 4 ][ y 2 y 3 ] 39/4 383/4 /6 2 [ x 3 x ][ y y ] 2/3 2/3 62/9 [ 6 6 x x ][ y 2 y ] 4/4 42/4 3/6 x x 2 x 3 x 4 x WVS=73.8 (a) Cell frequency () Bucket information Figure 4.3: AMH at time 2 y 4 9 y 3 7 y 2 y x x 2 x 3 x 4 x (a) Buckets after merging () BPT after merging Figure 4.4: Example of ucket merge (cont. Figure 4.)
6 Algorithm BucketMerge. minmpen= ; newleaf=null 2. let =the first intermediate node in the postorder traversal 3. do 4. if ( has at least one leaf node). compute f, g, MPen from its children 6. if (MPen <minmpen) 7. minmpen=mpen ; newleaf=; 8. =next intermediate node in the postorder traversal 9. until (=root). merge the children of newleaf End BucketMerge Figure 4.: ucketmerge algorithm 4.4 AMH ucket split A ucket split creates refined partitions in regions where uckets have large variance. It improves AMH s approximation accuracy y (i) reducing the ucket frequency variance, and (ii) decreasing the ucket extent. Towards this, the algorithm splits the ucket with the highest split enefit (SBen), i.e., the maximum decrease of WVS achieved y splitting this ucket at the est position. Consider, for example, ucket in Figure 4.4a. Since a split guarantees that the resulting uckets cover an integer numer of cells, there are three possile ways to split (one on the x, and two on the yaxes, respectively). The largest reduction of WVS is achieved y splitting on the xdimension, resulting in uckets 2, 3 (Figure 4.6a) and split enefit n v (n 2 v 2 +n 3 v 3 ). The split enefits of the other uckets are also computed and the one with the largest SBen (in this case ) is split (at its est split position). The ucketsplit algorithm repeats the split process until (i) new data/queries arrive, (ii) the numer of uckets reaches the maximum threshold B or (iii) there does not exist any ucket with positive split enefit, implying that the current ucket numer is sufficient to model the cell frequency distriution. Continuing the example, ucketsplit computes SBen for the new uckets ( 2, 3 ) and splits (whose enefit is currently the largest) into 4,. Since the ucket numer has reached B (=6), the algorithm terminates. Figures 4.6a, 4.6 illustrate the final ucket extents and BPT, respectively. y y 4 y 3 y 2 y PT query q x x 2 x 3 x 4 x (a) Final Buckets () BPT after splitting Figure 4.6: After splitting, (cont. Figure 4.4) Figure 4.7 summarizes the ucketsplit algorithm. Each ucket reorganization involves at most one ucket merging (i.e., if the numer of uckets equals B), and a small numer of ucket splits. After its termination, the process starts again if the CPU is still idle. Compared to traditional histograms requiring expensive total reuilding, AMH deploys repetitive, simple, interruptile, partial reorganizations, and, therefore, it is ideal for dynamic spatiotemporal environments, where updates/queries may occur in (unpredictale) ursts. Algorithm BucketSplit. for every ucket 2. compute its split enefit and est split position 3. sort all split enefits in descending order into a list L s 4. while (ucket numer n <B and no new data/query and the first ucket in L s has positive split enefit). remove the first ucket from L s 6. split it into, 2 7. otain split enefits and est split positions for, 2 8. insert, 2 into the appropriate positions in L s 9. return End BucketSplit Figure 4.7: ucketsplit algorithm 4. AMH PT query processing A presenttime query is answered y examining all uckets whose extents intersect the query region q R. Such uckets are located y traversing BPT and following the nodes that overlap q R. For example, the query q in Figure 4.6a intersects only ucket 9, which can e reached y following the intermediate nodes,. For each intersecting ucket k, we compute the overlap area R intr etween its extent and q R, and otain a partial result f k R intr w 2 (note that R intr w 2 equals the numer of cells in k that are covered y q R ). Then, the final result is computed as the sum of the partial results of all intersecting uckets, without accessing the underlying cells.. Historical Synopsis In this section, we discuss HT query processing using the historical synopsis that consists of (i) a dynamic histogram maintaining the currently valid uckets (e.g., AMH), and (ii) a past index T storing the osolete uckets.. General architecture A ucket in AMH dies if (i) the frequency in any of its cells is updated, or (ii) its extent changes due to a split or merge. In case (i), a new ucket with the same extent, ut different frequency statistics (i.e., f and g) is created. In case (ii) new ucket(s) are created due to merge or split. Each ucket contains a lifespan [l s, l e ), where l s (l e ) is the time of creation (death). All the uckets in AMH are alive, and their l e equals now. Dead uckets are inserted into index T and their cell content is discarded. In Figure 4., for example, all the uckets o,...,o 6 are alive and have lifespans [,now). Then, in Figures 4.3 uckets, 3,, 6 die at time 2 ecause some cells in their extents change frequencies at this timestamp. This
7 produces four dead uckets with lifespans [,2) (which are inserted into T), and creates four live ones with lifespans [2,now). At the next timestamp 3, uckets, 3, 4,, 6 die (yielding four uckets with lifespans [2,3], and one 4 with lifespan [,3)) since 3, 4, 6 are merged (Figure 4.4) and, are split (Figure 4.6). Buckets 9, 2, 3, 4, that result from these operations are alive with lifespans [3, now). As shown in Figure., after these operations, T contains 9 dead uckets and AMH 6 live ones. dead stored in T alive stored in AMS node R f g v lifespan [ x x 2 ][ y 3 y ] 7/6 9/6 /36 [,2) 3 [ x 3 x 4 ][ y 4 y ] 3/4 43/4 3/6 [,2) [ x 3 x ][ y y ] 3/3 3/3 [,2) 6 [ x x ][ y 2 y ] 2/4 /4 3/6 [,2) 3 [ x 3 x 4 ][ y 4 y ] 39/4 38/4 3/6 [2,3) 4 [ x 3 x 4 ][ y 2 y 3 ] 39/4 383/4 /6 [,3) 6 [ x x ][ y 2 y ] 4/4 42/4 3/6 [2,3) [ x x 2 ][ y 3 y ] 32/6 284/6 68/36 [2,3) [ x 3 x ][ y y ] 2/3 2/3 62/9 [2,3) 2 [ x x 2 ][ y y 2 ] 2/4 2/4 8/6 [,now) * o 9 [ x 3 x ][ y 3 y ] /2 [3,now) 2 [ x x ][ y 3 y ] 3/3 3/3 [3,now) x 3 [ 2 x 2 ][ y 3 y ] 29/3 28/3 2/3 [3,now) [ x [3,now) 4 3 x 4 ][ y y ] 2/2 2/2 x [ x ][ y y ] / / [3,now) Figure.: Buckets at time 3 (cf. Figures 4., 4.3, 4.6) The size of T grows continuously with time and eventually exceeds the availale memory, in which case part of it migrates to the disk. The migration is performed in locks, meaning that we simply move pages of equal sizes from the memory to the disk. Further, the disk accesses are sequential except for, possily, the first page transferred. Accordingly, pointers to these pages are modified to the corresponding disk addresses (from their original mainmemory addresses). Each page of T that is currently in memory is assigned a migration rank, such that pages are migrated in ascending order of their ranks. A separate mainmemory structure (e.g., priority queue) can e maintained to select the page with the lowest rank efficiently. In our implementation, the migration rank of a page is set to the latest ending timestamp l e of all its entries lifespans. In general, pages with low ranks should contain old uckets, and e less likely accessed y queries. Finally, pages that are already on the disk may e eventually (physically) erased or moved to tertiary storage, when they ecome completely osolete..2 Implementation using a packed Btree The past index T should support the following operations efficiently: (i) insertion of a dead ucket, and (ii) selection of uckets intersecting a HT query. Our first implementation adopts a packed Btree, which indexes the ending time l e of the uckets lifespans. Specifically, each leaf entry stores all information of a ucket (i.e., R, f, g, and its life span), while an intermediate entry stores a lifespan [l s, l e ), which encloses those of the leaf entries in * * its child node. All the nodes in the Btree are full, except for the rightmost one at each level, which does not have a minimum utilization requirement. Particularly, we refer to the rightmost leaf node as the active leaf, for which a special pointer is maintained. Since uckets are inserted into T in ascending order of their l e, each insertion simply appends the new ucket into the active leaf, and terminates if the leaf incurs no overflow. Otherwise, a new active leaf is created with the most recently inserted ucket, and this change propagates to the parent levels. The amortized insertion cost is constant. Figure.2 illustrates the corresponding packed Btree for the dead uckets in Figure., assuming a capacity of 3 entries per node. D [,2) [,3) [2,3) [,2) 3 [,2) [,2) 6 [,2) 3 [2,3) 4 [,3) 6[2,3) [2,3) [2,3) A B C lifespan active leaf Figure.2: The packed Btree In some cases, HT may have to access AMH, if there are some uckets that were created at the query timestamp, ut still remain alive. Consider, for example, the query with region q R =[x 2,x 3 ][y 2,y 3 ] and timestamp q T = on the uckets of Figure. (the corresponding AMH is shown in Figure 4.6). The query must visit all the uckets whose extents (lifespans) intersect q R at time, namely, those marked with * in Figure.. The final result is the sum of the partial results of all such uckets. The partial result from ucket 2 (which is still alive in AMH), is otained y a PT query as discussed in Section 4.. To locate the intersecting uckets (i.e., and 4 ) in the Btree, the algorithm accesses the nodes whose lifespans contain q T (=), i.e., nodes D, A, B in Figure.2. At the leaf level the partial results of qualifying uckets are otained in the same way as PT queries (i.e., y computing the size of the intersection area with q R ). The packed Btree implementation contains a simple and very efficient insertion algorithm, ut processes HT queries using only temporal pruning. In the next section, we descrie an alternative structure that takes advantage of spatial conditions to accelerate query processing..3 Implementation using a 3D Rtree Our second implementation is ased on a mainmemory adaptation of the 3D R*tree [BKSS9]. In this structure, each intermediate entry contains a 3D ox which encloses the extents and lifespans of all the uckets in its sutree. Given a HT q (which can also e regarded as a 2D rectangle defined y q R at time q T ), we only need to visit those nodes whose 3D oxes intersect that of q. Compared to the packed Btree, the 3D Rtree implementation achieves lower query cost since it utilizes oth spatial and temporal conditions to prune the search space. On the other hand, it incurs higher update overhead
Extracting k Most Important Groups from Data Efficiently
Extracting k Most Important Groups from Data Efficiently Man Lung Yiu a, Nikos Mamoulis b, Vagelis Hristidis c a Department of Computer Science, Aalborg University, DK9220 Aalborg, Denmark b Department
More informationEfficient Processing of Joins on Setvalued Attributes
Efficient Processing of Joins on Setvalued Attributes Nikos Mamoulis Department of Computer Science and Information Systems University of Hong Kong Pokfulam Road Hong Kong nikos@csis.hku.hk Abstract Objectoriented
More informationApproximately Detecting Duplicates for Streaming Data using Stable Bloom Filters
Approximately Detecting Duplicates for Streaming Data using Stable Bloom Filters Fan Deng University of Alberta fandeng@cs.ualberta.ca Davood Rafiei University of Alberta drafiei@cs.ualberta.ca ABSTRACT
More informationEstimating Query Result Sizes for Proxy Caching in Scientific Database Federations
Estimating Query Result Sizes for Proxy Caching in Scientific Database Federations Tanu Malik, Randal Burns Dept. of Computer Science Johns opkins University Baltimore, MD 21218 Nitesh V. Chawla Dept.
More informationApproximate Frequency Counts over Data Streams
Approximate Frequency Counts over Data Streams Gurmeet Singh Manku Stanford University manku@cs.stanford.edu Rajeev Motwani Stanford University rajeev@cs.stanford.edu Abstract We present algorithms for
More informationCostAware Strategies for Query Result Caching in Web Search Engines
CostAware Strategies for Query Result Caching in Web Search Engines RIFAT OZCAN, ISMAIL SENGOR ALTINGOVDE, and ÖZGÜR ULUSOY, Bilkent University Search engines and largescale IR systems need to cache
More informationBitmap Index Design Choices and Their Performance Implications
Bitmap Index Design Choices and Their Performance Implications Elizabeth O Neil and Patrick O Neil University of Massachusetts at Boston Kesheng Wu Lawrence Berkeley National Laboratory {eoneil, poneil}@cs.umb.edu
More informationSizeConstrained Weighted Set Cover
SizeConstrained Weighted Set Cover Lukasz Golab 1, Flip Korn 2, Feng Li 3, arna Saha 4 and Divesh Srivastava 5 1 University of Waterloo, Canada, lgolab@uwaterloo.ca 2 Google Research, flip@google.com
More informationLow Overhead Concurrency Control for Partitioned Main Memory Databases
Low Overhead Concurrency Control for Partitioned Main Memory bases Evan P. C. Jones MIT CSAIL Cambridge, MA, USA evanj@csail.mit.edu Daniel J. Abadi Yale University New Haven, CT, USA dna@cs.yale.edu Samuel
More informationA Googlelike Model of Road Network Dynamics and its Application to Regulation and Control
A Googlelike Model of Road Network Dynamics and its Application to Regulation and Control Emanuele Crisostomi, Steve Kirkland, Robert Shorten August, 2010 Abstract Inspired by the ability of Markov chains
More informationStaring into the Abyss: An Evaluation of Concurrency Control with One Thousand Cores
Staring into the Abyss: An Evaluation of Concurrency Control with One Thousand Cores Xiangyao Yu MIT CSAIL yxy@csail.mit.edu George Bezerra MIT CSAIL gbezerra@csail.mit.edu Andrew Pavlo Srinivas Devadas
More informationStaring into the Abyss: An Evaluation of Concurrency Control with One Thousand Cores
Staring into the Abyss: An Evaluation of Concurrency Control with One Thousand Cores Xiangyao Yu MIT CSAIL yxy@csail.mit.edu George Bezerra MIT CSAIL gbezerra@csail.mit.edu Andrew Pavlo Srinivas Devadas
More informationRobust Set Reconciliation
Robust Set Reconciliation Di Chen 1 Christian Konrad 2 Ke Yi 1 Wei Yu 3 Qin Zhang 4 1 Hong Kong University of Science and Technology, Hong Kong, China 2 Reykjavik University, Reykjavik, Iceland 3 Aarhus
More informationLIRS: An Efficient Low Interreference Recency Set Replacement Policy to Improve Buffer Cache Performance
: An Efficient Low Interreference ecency Set eplacement Policy to Improve Buffer Cache Performance Song Jiang Department of Computer Science College of William and Mary Williamsburg, VA 231878795 sjiang@cs.wm.edu
More informationContinuously Maintaining Quantile Summaries of the Most Recent N Elements over a Data Stream
Continuously Maintaining Quantile Summaries of the Most Recent N Elements over a Data Stream Xuemin Lin University of New South Wales Sydney, Australia Jian Xu University of New South Wales Sydney, Australia
More informationData Mining on Empty Result Queries
.............................. Dr. Nysret Musliu M.Sc. Arbeit Data Mining on Empty Result Queries ausgeführt am Institut für Informationssysteme Abteilung für Datenbanken und Artificial Intelligence der
More informationAdaptive Join Operators for Result Rate Optimization on Streaming Inputs
1 Adaptive Join Operators for Result Rate Optimization on Streaming Inputs Mihaela A. Bornea, Vasilis Vassalos, Yannis Kotidis #1, Antonios Deligiannakis 2 # Department of Informatics, Department of Electronic
More informationDensityBased Clustering over an Evolving Data Stream with Noise
DensityBased Clustering over an Evolving Data Stream with Noise Feng Cao Martin Ester Weining Qian Aoying Zhou Abstract Clustering is an important task in mining evolving data streams. Beside the limited
More informationWhat s the Difference? Efficient Set Reconciliation without Prior Context
hat s the Difference? Efficient Set Reconciliation without Prior Context David Eppstein Michael T. Goodrich Frank Uyeda George arghese,3 U.C. Irvine U.C. San Diego 3 ahoo! Research ABSTRACT e describe
More informationWTF: The Who to Follow Service at Twitter
WTF: The Who to Follow Service at Twitter Pankaj Gupta, Ashish Goel, Jimmy Lin, Aneesh Sharma, Dong Wang, Reza Zadeh Twitter, Inc. @pankaj @ashishgoel @lintool @aneeshs @dongwang218 @reza_zadeh ABSTRACT
More informationDealing with Uncertainty in Operational Transport Planning
Dealing with Uncertainty in Operational Transport Planning Jonne Zutt, Arjan van Gemund, Mathijs de Weerdt, and Cees Witteveen Abstract An important problem in transportation is how to ensure efficient
More informationReal Time Personalized Search on Social Networks
Real Time Personalized Search on Social Networks Yuchen Li #, Zhifeng Bao, Guoliang Li +, KianLee Tan # # National University of Singapore University of Tasmania + Tsinghua University {liyuchen, tankl}@comp.nus.edu.sg
More informationOutOfCore Algorithms for Scientific Visualization and Computer Graphics
OutOfCore Algorithms for Scientific Visualization and Computer Graphics Course Notes for IEEE Visualization 2002 Boston, Massachusetts October 2002 Organizer: Cláudio T. Silva Oregon Health & Science
More informationEffective Computation of Biased Quantiles over Data Streams
Effective Computation of Biased Quantiles over Data Streams Graham Cormode Rutgers University graham@dimacs.rutgers.edu S. Muthukrishnan Rutgers University muthu@cs.rutgers.edu Flip Korn AT&T Labs Research
More informationRevisiting the Edge of Chaos: Evolving Cellular Automata to Perform Computations
Revisiting the Edge of Chaos: Evolving Cellular Automata to Perform Computations Melanie Mitchell 1, Peter T. Hraber 1, and James P. Crutchfield 2 In Complex Systems, 7:8913, 1993 Abstract We present
More informationBenchmarking Cloud Serving Systems with YCSB
Benchmarking Cloud Serving Systems with YCSB Brian F. Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, Russell Sears Yahoo! Research Santa Clara, CA, USA {cooperb,silberst,etam,ramakris,sears}@yahooinc.com
More informationCumulus: Filesystem Backup to the Cloud
Cumulus: Filesystem Backup to the Cloud Michael Vrable, Stefan Savage, and Geoffrey M. Voelker Department of Computer Science and Engineering University of California, San Diego Abstract In this paper
More informationA Scalable ContentAddressable Network
A Scalable ContentAddressable Network Sylvia Ratnasamy 1 2 Paul Francis 2 Mark Handley 2 Richard Karp 1 2 Scott Shenker 2 1 Dept. of Electrical Eng. & Comp. Sci. 2 ACIRI University of California, Berkeley
More informationIEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, 2013. ACCEPTED FOR PUBLICATION 1
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, 2013. ACCEPTED FOR PUBLICATION 1 ActiveSet Newton Algorithm for Overcomplete NonNegative Representations of Audio Tuomas Virtanen, Member,
More informationACMS: The Akamai Configuration Management System
ACMS: The Akamai Configuration Management System Alex Sherman, Philip A. Lisiecki, Andy Berkheimer, and Joel Wein. Akamai Technologies, Inc. Columbia University Polytechnic University. {andyb,lisiecki,asherman,jwein}@akamai.com
More information