Use of Numerical Models as Data Proxies for Approximate Ad-Hoc Query Processing

What is the cost of storing and retrieving data?
What is the cost of buldng the model per node?
What is the cost of buldng the model?

Transcription

1 Preprnt UCRL-JC-?????? Use of Numercal Models as Data Proxes for Approxmate Ad-Hoc Query Processng R. Kammura, G. Abdulla, C. Baldwn, T. Crtchlow, B. Lee, I. Lozares, R. Musck, and N. Tang U.S. Department of Energy Lawrence Lvermore Natonal Laboratory Ths artcle appears n the Proceedngs of the 7 th Jont Conference on Informaton Systems, Research Trangle Park, NC, September Approved for publc release; further dssemnaton unlmted

2 DISCLAIMER Ths document was prepared as an account of work sponsored by an agency of the Unted States Government. Nether the Unted States Government nor the Unversty of Calforna nor any of ther employees, makes any warranty, express or mpled, or assumes any legal lablty or responsblty for the accuracy, completeness, or usefulness of any nformaton, apparatus, product, or process dsclosed, or represents that ts use would not nfrnge prvately owned rghts. Reference heren to any specfc commercal product, process, or servce by trade name, trademark, manufacturer, or otherwse, does not necessarly consttute or mply ts endorsement, recommendaton, or favorng by the Unted States Government or the Unversty of Calforna. The vews and opnons of authors expressed heren do not necessarly state or reflect those of the Unted States Government or the Unversty of Calforna, and shall not be used for advertsng or product endorsement purposes. Ths s a preprnt of a paper ntended for publcaton n a journal or proceedngs. Snce changes may be made before publcaton, ths preprnt s made avalable wth the understandng that t wll not be cted or reproduced wthout the permsson of the author. Ths report has been reproduced drectly from the best avalable copy. Avalable electroncally at Avalable for a processng fee to U.S. Department of Energy And ts contractors n paper from U.S. Department of Energy Offce of Scentfc and Techncal Informaton P.O. Box 62 Oak Rdge, TN Telephone: (865) Facsmle: (865) E-mal: reports@adons.ost.gov Avalable for the sale to the publc from U.S. Department of Commerce Natonal Techncal Informaton Servce 5285 Port Royal Road Sprngfeld, VA Telephone: (800) Facsmle: (703) E-mal: orders@nts.fedworld.gov Onlne orderng: OR Lawrence Lvermore Natonal Laboratory Techncal Informaton Department s Dgtal Lbrary

3 Use of Numercal Models as Data Proxes for Approxmate Ad-Hoc Query Processng R. Kammura 1, G. Abdulla 1, C. Baldwn 1, T. Crtchlow 1, B. Lee 2, I. Lozares 1, R. Musck 3, N. Tang 1 1 CASC, Lawrence Lvermore Natonal Laboratory, CA 2 Department of Computer Scence, Unversty of Vermont, VT 3 Ikun, Inc., CA Abstract As datasets grow beyond the ggabyte scale, there s an ncreasng demand to develop technques for dealng/nteractng wth them. To ths end, the DataFoundry team at the Lawrence Lvermore Natonal Laboratory has developed a software prototype called Approxmate Adhoc Query Engne for Smulaton Data (AQSm). The goal of AQSm s to provde a framework that allows scentsts to nteractvely perform adhoc queres over terabyte scale datasets usng numercal models as proxes for the orgnal data. The advantages of ths system are several. The frst s that by storng only the model parameters, each dataset occupes a smaller footprnt compared to the orgnal, ncreasng the shelf lfe of such datasets before they are sent to archval storage. Second, the models are geared towards approxmate queryng as they are bult at dfferent resolutons, allowng the user to make the tradeoff between model accuracy and query response tme. Ths allows the user greater opportuntes for exploratory data analyss. Lastly, several dfferent models are allowed, each focusng on a dfferent characterstc of the data thereby enhancng the nterpretablty of the data compared to the orgnal. The focus of ths paper s on the modelng aspects of the AQSm framework. 1. Introducton Advances n computer technology and decreasng hardware costs have facltated the ease by whch large volumes of data are measured, generated, and stored. Whle ths s vewed as an nformaton bonanza, the result has often been a flood of data wth lmted means of makng effectve use of t. The central problem s that of sze. When datasets reach the ggabyte level and beyond, storng the nformaton as well as beng able to nteract wth t become ncreasngly constraned n space or tme or n some cases both. Whle research s beng conducted n developng mass storage devces, there wll always be lmted resources to accommodate the dataflow. A related ssue s that of data compresson as deally one would lke to store as mnmal nformaton as possble to maxmze the effectveness of the hardware. But even f the spatal constrants are resolved, the ssue of how to nteract wth the data wthn a reasonable perod of tme stll remans. Data analyss tasks such as askng an ad-hoc query become neffectve for large scale data sets f the tme to fnd and retreve the exact answer to the query posed s often longer than the user s wllng to tolerate. Ths s especally true f the scentst s n an exploratory data analyss mode and has several questons that they would lke to examne. To deal wth tme constrant problem, one approach that has been used recently s approxmate queryng systems, whch seek not to provde an exact soluton to the posed query but an approxmate one. The ratonal s that the user may prefer a fast, approxmate answer as they explore the dataset to develop a better understandng of truly nterestng queres and/or regons. To ths end, Acharya, et.al., 1999, Acharya, et.al., 1999a, Chakrabart s, et.al., 2000, Ioannds and Poosala (1999), Vtter and Wang (1999), Vtter and Wang (1998) have developed dfferent approaches for dealng wth aggregate and range queres n an approxmate fashon. However, the emphass n the above approaches focused on the ssue of a fast query response system and dd not specfcally address the ssue of storng such large data sets. Ths latter pont s mportant f the user should wsh to revst the data set whch, owng to ts sze, wll often have to be moved to tertary storage as s often the case wth scentfc smulatons. Accessng the data at such locatons s often prohbtve tme-wse and should be avoded f possble. As a result, what s needed s a system that not only allows approxmate ad hoc queres but also provdes a compact representaton of the data that s far easer to store and access compared to the orgnal. A survey of the lterature uncovers that Hachem, et.al. (1995) have mplemented a system where polynomal regresson models are used to model non-spatal data and the regresson coeffcents are stored n the database. In response to a query, the data are reconstructed from models. Such a system provdes the benefts of approxmate query response but at a far smaller storage cost. Followng a smlar approach ndependently, the DataFoundry team at Lawrence Lvermore Natonal Laboratory has developed the Approxmate Ad-hoc Query Engne for Smulaton Data (AQSm). The goal of AQSm s to provde a software framework that not only allows relatvely fast nteractve adhoc query capablty of terabyte scale data but also provdes a compact representaton of the data. Ths s acheved by constructng mathematcal models of the data at dfferent resolutons. In effect, these models act as data proxes and perform n ther place n response to a query submsson. Ths has several benefts. Frst, the dfferent resolutons of the data allow the user to trade off tme for accuracy. Second, along wth an ndex fle, only the model parameters are stored n a fle. As ths number s often smaller than the orgnal dataset sze, the shelf lfe of the dataset can be extended. The user can send the orgnal data to archval storage but can query usng the model parameters for data reconstructon. Fnally, as mathematcal models often characterze dfferent facets of the data, the model parameters themselves may provde addtonal nsghts not apparent when just dealng wth the orgnal raw values. Thus, n contrast to Hachem, et.al. s work where only one model s used to address the dfferent query needs of the scentsts, AQSm uses multple 1

4 Query processng query Preprocessng mesh fle Query Engne Metadata ndex to use Mesh converson reformatted data Index Searcher data value ndex fles Index Constructor Mesh Parttoner Predcate Processor/ Data Value Reconstructor model fles Modeler fltered data value Vsualzaton Fgure 1. Framework for AQSm. models. Whle AqSm s bult for dealng wth the spatotemporal nature of the scentfc smulatons for the Accelerated Strategc Computng Intatve program (ASCI), the framework s generc enough to be appled to other domans that deal wth terabyte scale datasets. In ths paper the focus s on the modelng aspect of the AQSm framework. Secton 2 provdes a bref overvew of the entre system. Secton 3 dscusses the modelng ssues that need to be addressed. Current results of AQSm are presented n Secton 4. Fnally, conclusons and future work are dscussed n Secton AQSm Framework Fgure 1 shows the AQSm framework under constructon. The system s dvded nto two phases: preprocessng and query processng. The goal of the frst phase s to take the orgnal data and buld the model fles and ndex structures necessary to support the query-processng step that follows t. As the focus wll be on developng a compact and accurate data proxy, more tme and computatonal effort s placed n ths ntal perod. The ratonal s to concentrate the computatonal burden when the data s close at hand and stll away from the user. In ths way, the tme spent durng query processng whch s where the user s expected to spend the bulk of ther tme can be mnmzed. The nput to the preprocessng wll be the orgnal data tself, the selecton of models to be mplemented, and a parttonng scheme. To acheve the approxmate query answer capablty, the data s statstcally characterzed and modeled at dfferent resolutons. These resolutons are created by parttonng the data nto progressvely smaller subunts and then subsequently modelng each of these parttons by the selected models. Ths procedure s repeated untl the parttonng reaches a selected threshold upon whch t s termnated. The current parttonng scheme s bsect the data along each of the spato-temporal coordnates (x,y,z,t), thereby creatng a 16-way tree. The ratonal was to take advantage of the spato-temporal nature of the scentfc smulatons that were beng studed. In any event, the end result s a tree structure. The root node represents the entre data set wth each of ts chldren (and subsequent chldren) representng a dfferent subset of the data. Each node contans the followng statstcs about the data that was n t: number of data ponts, number of models bult on the data n that node, the types of models used, model errors, and for each varable, ts correspondng mean, standard devaton, mnmum and maxmum values. As wll be dscussed later, the ndex searcher wll use ths nformaton when searchng for a response to a query. If drectly mplemented as stated above, one problem that s mmedately encountered s that s t unlkely that a model bult wth a majorty of the data s accurate or that t can be bult n a reasonably short tme. Hence, to balance the tme-accuracy tradeoff, the key s to fx the model complexty for each node. The supposton s that as the data becomes ncreasngly parttoned, the localzaton wll ncrease the lkelhood that a model of fxed complexty wll be able to model the behavor reasonably well. A comparable analogy s to break a nonlnear curve nto pece-wse components whch dependng on the length of the component can be suffcently lnear for lnear regresson models to be bult on the components. Naturally, there s no guarantee that the accuracy wll reach what s desred but ths combnaton of parttonng and modelng s beng used to a frst approxmaton. As wll be dscussed n Secton 5, dentfyng metrcs as to when to buld models as opposed to havng them bult at every node s a research topc. If such a metrc s found, t can decrease storage and computatonal costs of buldng models to only those that are needed. Once the parttonng and model buldng s done, the output from the preprocessng stage s an ndex fle contanng all the node nformaton along wth the correspondng model fles. It 2

5 s mportant to note the orgnal data s not stored n ether the ndex or model fles. The nodes only contan the nformaton lsted above along wth ponters to ts chldren. The model fles only contan nformaton about the model parameters that were used for each node. Hence, t s hoped that these subsequent fles wll be smaller n sze than the orgnal data. So, f a partcular secton of the data s to be reconstructed one needs to know whch node and whch model to use to reconstruct that partcular subsecton. The query-processng phase s the sde that deals prmarly wth the user and s the means by whch they present queres to the system. After the ndex and model fles have been ntalzed and read n by the system, the user can pose adhoc queres to the query engne. The query engne then dentfes the varables and predcates n the query and selects an ndex to use based on the metadata. An ndex searcher then examnes the ndex nodes usng the statstcal nformaton contaned wthn them to determne whch nodes (and models) are most relevant to the query. A lst of canddate nodes s then sent to a predcate processor/data value reconstructor, whch then reconstructs the data and formulates a response. Durng ths perod, the user can specfy the accuracy of the response and ths nformaton wll be used to determne how deep wthn the tree to search. Once the data has been reconstructed t s then sent to a vsualzaton tool to vew the results. 3. Modelng Issues As the models act as proxes for the data, t s essental that a set of crtera or metrcs be developed to assess the feasblty of usng a partcular model for the AQSm system. Intally, our focus was on tradtonal modelng crtera such as model error and complexty as these are often used to assess model performance. Unfortunately, due to the scale of the data nvolved and the tght couplng wth respect to queres, these crtera were found not to be partcularly nformatve. For example, model error as a measure of model accuracy was relevant but ts mportance was somewhat reduced to due to the multresolutonal structure of the tree whch reconstructs the data, n prncple, at a hgher accuracy (barng overfttng) as the leaves of the tree are approached. But as ths s nevertheless an mportant measure, mean square error or root mean square error was selected as a model error metrc to establsh a standard. Smlarly, model complexty was also vewed at frst as mportant but oftentmes metrcs such as Akake or Bayes nformaton crtera are often used to compare among a set of models, balancng model smplcty wth error, Myung (2000). The drawback was that the sze of the datasets precludes the buldng of several models or even teratng over the same model to obtan a better ft. As such, a new set of crtera has been developed focusng prmarly on the computatonal and storage cost of buldng models as well as examnng ther applcablty for dealng wth varous query types. The former has been smply termed as model buldng whereas model queryablty s used for the latter Model buldng Here, the goal s to assgn some sem-quanttatve measure to how computatonal expensve buldng a model can be as well as the assocated cost of storng the model parameters. Assume that y s the th varable of the orgnal data and x s a vector of ndependent varable(s) assocated wth y. x s denoted as a vector to emphasze generc-ness. X can be another varable n the data set (unvarate model), a collecton of other varables (multvarate model), or smply an ndex. Wth ths n mnd, the model s formulated as such: y = f ( x ; θ ) (1) where f denotes the model tself and θ s a vector of model parameters (coeffcents) The computatonal cost of buldng the model (equaton 1) per node s expressed n bg O-notaton (Cormen, et.al., 1998): O(F(n)) + r*o(g(n)) (2) where n represents the number of samples n a gven node, O(F(n)) represents the cost of buldng the ntal model where F(n) also ncludes any model specfc preprocessng steps, r s the number of steps needed to refne the model to ether a desred level of complexty or error ( r > 0), O(G(n)) s the cost of the model refnement steps where G(n) represents the functonal form of the cost as a functon of the number of samples. G(n) may be the same as F(n) or functonally dfferent. F(n), G(n) represent the generc functon forms. (eg. they can be n, nlogn, n 2, etc). The term r encapsulates the teratve nature of the refnement process dealng wth ssues such as ncreasng the number of knots n a B-splne or the processng steps for determnng the approprate coeffcent threshold values to acheve a target error. If the user desres a more accurate model, the prce s the addtonal computatonal tme requred to refne the model. Once the model has been bult, there s the ssue of the storage sze of the model. The storage cost per ndex tree node, s, can be estmated as s = p * v + (3) where s the sze of ndex mappng, v s the number of varables, and p s the parameter sze per varable, whch s the sze of θ (number of coeffcents, for example) multpled by the sze of the data type used (e.g., nteger, float, double). In general, the parameter sze cannot be readly estmated snce t s dependent on the data dstrbuton and the level of error specfed. However, f the model complexty s set, the sze can be estmated as the user explctly lmts the number of parameters per varable. Fnally, the ndex mappng may or may not exst dependng on the model type but s presented for completeness. It represents any addtonal bookkeepng that may be requred to reconstruct the data Model buldng trade-offs The above presents a formalsm to estmate the cost of model constructon n both memory and computaton. The followng general observatons can be made: As parttonng occurs, n becomes smaller so the computatonal cost goes down although functonally equaton 2 remans the same. If lower modelng error s desred, ths usually translates nto an ncrease n r of equaton 2 -- ncreasng the number of teratons to brng the model to convergence. θ s also lkely to ncrease (but not always) n sze as addtonal parameters are needed to brng the model to target error levels. Dependng on the modelng algorthm used, the number of teratons and the parameter sze may or may not be ted wth each other. In the case of regresson, addtonal model terms are ntroduced to reduce model error so the teraton and parameter sze are lnked. On the other hand, wth neural networks, the weghts of system do 3

6 not change n number but ther values are adjusted wth each teraton cycle. To summarze, to assess the full mpact of model buldng, the followng need to be known: Intal model cost O(F(n)) Model refnement cost O(G(n)) Model complexty (related to error as well) ths sets r and sze of θ Bookkeepng s t needed and f so how bg Once ths s known, equatons 2 and 3 can be used to estmate cost and storage requrements on a per node bass for a gven model. Ths nformaton then forms a bass by whch dfferent models can be compared on model constructon/memory storage crtera. To consder the effect of the entre data set, multply the above estmates (equatons 2 and 3) by the number of nodes n the ndex tree and adjust for n n the dfferent nodes Model queryablty As mentoned prevously t s not suffcent that the model smply reconstruct the data to wthn a small error as the nature of the ggabyte+ scale data make buld such models computatonally expensve. Ideally the most useful models are the ones that not only reconstruct data well but are amenable to dealng wth a varety of queres. It s ths ssue of queryablty that wll be dscussed n ths secton. In bref, queryablty consders the queston Short of reconstructng the data, how can the model structure of functonalty be used to answer the query? All the models n AQSm have the capablty of reconstructng the data to varyng degrees of accuracy. However, some models wll capture dfferent aspects of the data structure n ther formulaton than others. For example, wth the case of smple lnear regresson of y = mx + b the parameter m contans nformaton on the rate of change of y wth respect to x. Ths ease of nformaton access s what needs to be characterzed n order to effectvely determne whch model to use to answer a query. Ideally, the goal s to arrve at a mappng that would assocate partcular models to certan query types. To acheve ths, a set of crtera s needed to assess the feasblty of dfferent model types n the context of the queres that may be presented. As AQSm wll ncorporate multple models, understandng the applcablty or relevance a partcular model may have n the context of respondng to a query s essental. A survey of the lterature does not reveal any type of model/query assocaton metrcs and ponts to a need to develop one for ths system. In addton, consderng the varety and complexty of queres that may be asked, t s unlkely that a sngle quanttatve metrc can be arrved to measure queryablty. Instead, t may be more feasble to break the assessment nto sem-qualtatve and sem-quanttatve components. The former focuses on breakng down the query types nto dfferent categores such as the one lsted below. (Ths s a rough lst and not meant to be exhaustve or mutually exclusve) 1) range for a gven nput, fnd the assocated output 2) aggregate fnd how many data samples satsfy a gven query predcate 3) scale-dependent patterns/ dynamcty fnd smlar types of behavor at the specfed scale 4) topologcal fnd the samples that are wthn a specfed target range 5) varable nteractons fnd the data samples that satsfy the specfed varable relatonshps 6) area/volume effects fnd the area or volume for a partcular varable condton 7) doman-specfc user defnes a functon of nterest n the query predcate Each model canddate s graded on ther ablty to deal wth the dfferent categores, specfcally whether they are able to address such queres (apart from brute force data reconstructon) and how accurately. An addtonal consderaton may be to assgn a weght to the dfferent query types as to how lkely they are to occur and/or how mportant they are. Ideally, the models that are selected for the AQSm should be complementary wth respect to each other to mnmze redundant coverage. In ths regard, models that are doman-specfc probably should probably be always ncluded snce t s assumed that ther constructon contans domanrelevant nformaton, whch would not to be readly avalable from the orgnal data. But agan ths need may have to be balanced wth respect to how often would that knd of nformaton s requested. The sem-quanttatve aspect may be vewed as the number of data processng steps and/or computatonal tme requred to process the model to arrve at the desred output. For example, f the query s of a range type, specfcally Fnd all x for ths value of y and two models exst -- one model s that of a splne regresson (y = f(x)) and the other s a (x = f(), y = f()), estmatng how long would t take each model to arrve at an answer s an nterestng metrc. The former would requre solvng for the roots of the polynomal whereas the s may smply reconstruct all the data for gven range and then apply a flterng operaton to solate the relevant answers. It should be emphaszed that ths type of analyss s sem-quanttatve n nature snce t s hghly dependent on the dstrbuton of the data and ntal startng ponts Model-query nteracton Ths secton focuses on the cost of nteractng wth a model to get the desred answer to the query. The knowledge ganed here allows models to be compared on the bass of how computatonally expensve they are when dealng wth queres. Assume that the user poses the query n the form, Gven y, provde h(x ). Agan, y denotes a varable of the orgnal data set. H(x ) s presented here as the generc output needed to answer the query. In ts smplest form, h(x ) = x where x s smlar to the one used n equaton 1 wth the focus on ether a sngle orgnal varable or a collecton of varables (snce the ndex s not part of the orgnal data the user s not lkely to make any nqures about t). In a more advanced form, h(x ) may be a user-defned functon that uses as nput x. To answer the query (fnd h(x ) for y ), the procedure s to start wth what the model provdes and then process the results untl the desred nformaton becomes avalable. Ths cost of answerng a query essentally can be estmated by equaton 4 (Corman, et.al., 1998): m = 1 O( f ( n)) = O( m = 1 f ( n)) where m = number of processng steps requred, O(f (n)) s the computatonal cost assocated wth each processng step, n = number of samples. It s assumed that the processng costs are lnearly addtve. The key les n what s meant by a processng step. Consder that there exsts a model for a sngle varable, x = f() where s an ndex, and another model for a dfferent varable, y = g(j) where j s a dfferent ndex. For a Fnd the x assocated wth y, ths would requre creatng or learnng the (4) 4

7 mappng between and j, say = m(j), f the approprate x-y pars are to be uncovered. Ths mappng would be a processng step n addton to the one where for the gven y, the j s would have to be generated, j = g -1 (y), another processng step. The cost of answerng the queres requres both steps be processed. As another example, assume that x of equaton 1 s some multvarate collecton and agan the query s Fnd the x assocated wth y. Ths can be answered by solvng the nverse of equaton 1 and, for llustratve purposes, t s further assumed that ths s O(n 2 ) : 1 x f = ( y ) (5) Ths s not the only approach to solvng the problem. A naïve but smple alternatve would be to reconstruct the data for all avalable y and shft through to look for the x s of nterest. Assume for the sake of argument that ths s of O(nlogn) for the generaton and O(nlogn) for the shftng. Now, t s not clear as to whch approach, the solver or the naïve one, s better. The solver has one processng step but t s computatonally expensve. The naïve way nvolves two steps but ndvdually they are not as expensve. So, dependng on whch route, as well as the sample sze nvolved, the cost can vary. So, equaton 4 should be evaluated on both routes to obtan an estmate of the cost wth the goal of selectng the processng route that s cheaper. Gven two models, t s mportant to determne how much t would cost to use one model over another by breakng t down to the number of processng steps requred to generate the approprate response, and the cost assocated wth each step for each model. The fnal decson based on choosng the lesser total of the two. Note that ths analyss can be expanded to go further than just fndng x but can also consder the cost of generatng h(x ). The formulaton of equaton 4 s such that t can be used to evaluate model performance aganst a range of query types. Ths s done by the formulaton of h(x ). For range queres, t may be suffcent to have h(x ) = x. To examne how they fare on topologcal queres, choose the approprate h(x ) and then estmate the costs. Ths analyss, however, s not complete because recall the model also has θ I s. Whle the emphass has been on largely obtanng h(x ) from x, t s possble that the parameters themselves readly lend to h(x ) and ths should be consdered when evaluatng a model. For example, dynamcty s reflected n the coeffcents, whle nformaton such as rate of change can be observed n the coeffcents of lnear regresson models. Note that equaton 4 smply focuses on the processng steps but s not concerned whether they nvolve parameters or x s. The complementary crteron to the model buldng descrbed n secton 3.1 s to then select the model wth the lowest cost for a desred query type or types. In summary, for each model under consderaton, t s recommended to frst use equatons 2, 3, and 4 to assess the computatonal cost and memory needs as the model s used aganst dfferent query types (f applcable). Note, that these equatons have some adjustable parameters, such as m and r, and these wll depend on the accuracy sought and the query beng asked. In general, the model that has the lowest cost n computaton and memory for a gven query s a strong vable canddate. Ths analyss has not consdered tree search costs snce ths cannot be determned from the model structure tself but may prove to be substantal for some models (to obtan a target accuracy) compared to others. The fnal assessment should also consder ths search cost as well once the ndex searchng protocol has been fnalzed. In short, a model that answers the query of nterest wth mnmal total computatonal cost and memory s a strong canddate to be consdered for AQSm. 4. Results Below s an llustraton for a smple (fnte length) model and a statstcal model usng the mean as the model usng the crtera dscussed n secton 3. The results have been broken nto model buldng and model-query nteracton and, for the latter, wth some selected query types of nterest and the memory usage Model buldng Modelng buldng ncurs the followng computatonal cost. stat 2*O(n) + O(n ln n) N*O(n) Here, the cost for the s ncludes the cost of scannng through the data (a conservatve hgh estmate s presented) and the need to cull the coeffcents to obtan a target error. The scannng consders the cost of the downsamplng when performng the decomposton. Note that, as the model s nherently multresolutonal, t can n theory just work on the parent node and acheve a hgh level of accuracy. As to the statstcs model, the large N refers to the total number of nodes n the ndex tree. To acheve a desred level of accuracy, the data wll be parttoned to the resoluton that s desred. If the data s wdely dstrbuted or hgh precson s requested, N can approach or exceed n n value. The reason s that f each leaf node s beleved to contan a sngle pont then mnmum N = n, but as ths s only the leaves and not the rest of the tree, N>n Model-query nteracton The followng three query cases provde an overvew of the computatonal cost the two models ncur when quered. Range query: stat_model O(n) + O(n) O(ln N) reconstruct/flter fnd relevant nput In the case of the, the extreme case s consdered where the entre data s reconstructed and then fltered upon to fnd the values of nterest. To smplfy the analyss, only flterng was consdered. (It s possble to have the reconstruct selected areas of the data, but ths would be covered by the bookkeepng and ts functon form s not clearly known for the purposes of ths exercse.) It may be possble to shorten ths to O(d) where d << n f a mappng can be acheved by lnkng the dstrbuton of the data values to partcular coeffcents. For the statstcs model, the cost s largely n searchng a tree of N nodes to fnd the model values that satsfy the query. Dynamcty query: stat_model O( ln C) O(n) + O(G(n)) Snce the models stores nformaton about the dynamcty (varable actvty) n the coeffcents, ths would smply be a search over these model parameters, C number of coeffcents. For the case of statstcal model, t would requre a reconstructon of the data followed by further processng of the data to defne the dynamcty functon. If somethng asde from standard devaton s desred, ths can 5

8 be costly. For example, G(n) = n 2 for consderng par-wse nteractons. Topologcal query: stat_model O(n) + ΣO(G(n)) m*o(ln N) + ΣO(G(n)) The prmary cost of the les n reconstructng the data and then buldng the necessary topologcal functon, desgnated by the summaton term. For the statstcs model, the cost les n fndng all the relevant nodes m and searchng through a tree of N nodes. Once the data s obtaned there s the ssue of buldng the topologcal functon just as n the case. In general the model s advantageous because of the ease by whch the data can be reconstructed to a hgh level of accuracy. For the statstcs model, there s the cost of searchng through the ndex tree but ts accuracy s unlkely to match that of the system unless a large ndex tree s created n whch case N (nodes) can be greater than n (number of samples n orgnal data). The two models use the followng memory space per varable. stat C + n 2* N For the case of the s, C represents the number of coeffcents stored for a gven level of accuracy and the n s to ndcate the sze of the ndex bookkeepng where n s the number of samples n the orgnal data. In a worse case scenaro ths would represent the mappng between unque data values and the coeffcents. If there are redundant values, t may be possble to reduce n consderably. Furthermore, f a common ndex scheme s used, then only one bookkeepng system may be requred as opposed to one per varable. Ths would reduce storage costs consderably. As for the statstcs model, only two parameters are needed the mean and standard devaton, hence the number 2. N refers to the number of nodes n the ndex tree. Agan f hgh precson s requested, N can reach or exceed n as the mean s ablty to accurately reflect the data ncreases as the partton sze becomes ncreasngly localzed. In summary, whle the actual performance does vary dependng on the data dstrbuton, t appears that f hgh accuracy s desred N>>n, then a model would be useful for most of the applcatons otherwse a statstcs model wll do. 5. Conclusons and Future Work The current mplementaton of AQSm only has a statstcal mean as a model but t s beng upgraded to ncorporate a model to provde addtonal functonalty. Addtonal work wll also be done n the areas of ndex constructon for hgh multdmensonal data, parttonng schemes beyond the spato-temporal bsecton currently n place, mappng of model metadata to query types, and a metrc to buld models selectve. The sgnfcance of the last pont s many modelng technques are computatonally ntensve to buld, for example, O(n 3 ) for regresson based systems. As there s no guarantee that for gven data dstrbuton the resultng model wll produce a hghly accurate model, a metrc s needed to gude when a model has a reasonable chance of successful reconstructon. If the model performance s gong to be poor, t wll be better to use the mean as a crude model owng to computatonal effcency. Other research areas n the context of model evaluaton nclude the followng: how amenable the models are to beng parallelzed, how automated the model constructon can be, how robust the model s to dealng wth dfferent data types, and whether the model can be constructed onlne. References Acharya, S, Gbbons, PB, Poosala, V, and Ramaswamy, S (1999), Jon Synopses for Approxmate Query Answerng, n Proceedngs of the ACM SIGMOD Internatonal Conf. on Management of Data (SIGMOD '99), June 1999 Acharya S., Gbbons PB, Poosala, V and Ramaswamy, S. (1999a), The Aqua Approxmate Query Answerng System, demo n ACM SIGMOD Internatonal Conf. on Management of Data (SIGMOD '99), June Chakrabart, K, Garofalaks, MN, Rastog, R, and Shm, K (2000), Approxmate Query Processng Usng Wavelets, Proceedngs of the Internatonal Conference on Very Large Databases, Caro, Egypt. Cormen,TH, Leserson CE, and Rvest, RL (1998), Introducton to Algorthms, MIT Press Hachem, N, Bao C, and Taylor, S (1995), Approxmate Query Answerng n Numercal Databases, techncal report, Department of Computer Scence, Worcester Polytechnc Insttute Ioannds, YE, and Poosala, V, (1999) Hstogram-Based Approxmaton of Set-Valued Query Answers, n Proceedngs of the 25 th Internatonal Conference on Very Large Data Bases Myung, IJ (2000), The Importance of Complexty n Model Selecton, Journal of Mathematcal Psychology, vol 44, p Vtter, JS, and Wang, M (1999), Approxmate Computaton of Multdmensonal Aggregates of Sparse Data usng Wavelets, n Proceedngs of the ACM SIGMOD Internatonal Conf. on Management of Data (SIGMOD '99), June 1999 Vtter, JS, and Wang, M (1998), Data Cube Approxmaton and Hstograms va Wavelets, n Proceedngs of the 7th Internatonal Conference on Informaton and Knowledge Management of Data 6