Value Function Approximation using Multiple Aggregation for Multiattribute Resource Management

Similar documents

Economics Letters 65 (1999) macroeconomists. a b, Ruth A. Judson, Ann L. Owen. Received 11 December 1998; accepted 12 May 1999

Reasoning to Solve Equations and Inequalities

Polynomial Functions. Polynomial functions in one variable can be written in expanded form as ( )

All pay auctions with certain and uncertain prizes a comment

Treatment Spring Late Summer Fall Mean = 1.33 Mean = 4.88 Mean = 3.

Graphs on Logarithmic and Semilogarithmic Paper

EQUATIONS OF LINES AND PLANES

COMPARISON OF SOME METHODS TO FIT A MULTIPLICATIVE TARIFF STRUCTURE TO OBSERVED RISK DATA BY B. AJNE. Skandza, Stockholm ABSTRACT

An Undergraduate Curriculum Evaluation with the Analytic Hierarchy Process

Hillsborough Township Public Schools Mathematics Department Computer Programming 1

Babylonian Method of Computing the Square Root: Justifications Based on Fuzzy Techniques and on Computational Complexity

Helicopter Theme and Variations

LINEAR TRANSFORMATIONS AND THEIR REPRESENTING MATRICES

Factoring Polynomials

ClearPeaks Customer Care Guide. Business as Usual (BaU) Services Peace of mind for your BI Investment

Basic Analysis of Autarky and Free Trade Models

Econ 4721 Money and Banking Problem Set 2 Answer Key

Experiment 6: Friction

Math 135 Circles and Completing the Square Examples

Modeling POMDPs for Generating and Simulating Stock Investment Policies

Performance analysis model for big data applications in cloud computing

Universal Regularizers For Robust Sparse Coding and Modeling

How To Network A Smll Business

ORBITAL MANEUVERS USING LOW-THRUST

Integration. 148 Chapter 7 Integration

How To Set Up A Network For Your Business

Lecture 3 Gaussian Probability Distribution

Small Business Networking

Appendix D: Completing the Square and the Quadratic Formula. In Appendix A, two special cases of expanding brackets were considered:

Small Business Networking

Small Business Networking

piecewise Liner SLAs and Performance Timetagment

Estimating Exchange Rate Exposures:

Use Geometry Expressions to create a more complex locus of points. Find evidence for equivalence using Geometry Expressions.

Distributions. (corresponding to the cumulative distribution function for the discrete case).

Operations with Polynomials

CHAPTER 11 Numerical Differentiation and Integration

Bayesian Updating with Continuous Priors Class 13, 18.05, Spring 2014 Jeremy Orloff and Jonathan Bloom

Fast Demand Learning for Display Advertising Revenue Management

Multiple Testing in a Two-Stage Adaptive Design With Combination Tests Controlling FDR

Decision Rule Extraction from Trained Neural Networks Using Rough Sets

belief Propgtion Lgorithm in Nd Pent Penta

Small Business Networking

** Dpt. Chemical Engineering, Kasetsart University, Bangkok 10900, Thailand

Or more simply put, when adding or subtracting quantities, their uncertainties add.

g(y(a), y(b)) = o, B a y(a)+b b y(b)=c, Boundary Value Problems Lecture Notes to Accompany

WEB DELAY ANALYSIS AND REDUCTION BY USING LOAD BALANCING OF A DNS-BASED WEB SERVER CLUSTER

2. Transaction Cost Economics

SPECIAL PRODUCTS AND FACTORIZATION

The Velocity Factor of an Insulated Two-Wire Transmission Line

Vectors Recap of vectors

5.2. LINE INTEGRALS 265. Let us quickly review the kind of integrals we have studied so far before we introduce a new one.

Learning to Search Better than Your Teacher

A.7.1 Trigonometric interpretation of dot product A.7.2 Geometric interpretation of dot product

Health insurance exchanges What to expect in 2014

Mathematics. Vectors. hsn.uk.net. Higher. Contents. Vectors 128 HSN23100

TITLE THE PRINCIPLES OF COIN-TAP METHOD OF NON-DESTRUCTIVE TESTING

This paper considers two independent firms that invest in resources such as capacity or inventory based on

PROF. BOYAN KOSTADINOV NEW YORK CITY COLLEGE OF TECHNOLOGY, CUNY

Small Businesses Decisions to Offer Health Insurance to Employees

Cost Functions for Assessment of Vehicle Dynamics

DlNBVRGH + Sickness Absence Monitoring Report. Executive of the Council. Purpose of report

Protocol Analysis / Analysis of Software Artifacts Kevin Bierhoff

Data replication in mobile computing

Contextualizing NSSE Effect Sizes: Empirical Analysis and Interpretation of Benchmark Comparisons

Section 7-4 Translation of Axes

Conference Paper Assignment techniques on Virtual Networks. Performance considerations on large multi-modal networks

ProfileMe: Hardware Support for Instruction-Level Profiling on Out-of-Order Processors

Research of Flow Allocation Optimization in Hybrid Software Defined Networks Based on Bi-level Programming

Efficient load-balancing routing for wireless mesh networks

Implementation Evaluation Modeling of Selecting ERP Software Based on Fuzzy Theory

Integration by Substitution

Space Vector Pulse Width Modulation Based Induction Motor with V/F Control

Section 5-4 Trigonometric Functions

4.11 Inner Product Spaces

Uplift Capacity of K-Series Open Web Steel Joist Seats. Florida, Gainesville, FL 32611;

Enhancing Q-Learning for Optimal Asset Allocation

Health insurance marketplace What to expect in 2014

Performance Prediction of Distributed Load Balancing on Multicomputer Systems

Introducing Kashef for Application Monitoring

To Hunt or to Scavenge: Competitive Advantage and Competitive Strategy in Platform Industries *

QoS Mechanisms C HAPTER Introduction. 3.2 Classification

Example 27.1 Draw a Venn diagram to show the relationship between counting numbers, whole numbers, integers, and rational numbers.

How To Study The Effects Of Music Composition On Children

Optiml Control of Seril, Multi-Echelon Inventory (E&I) & Mixed Erlng demnds

The mean-variance optimal portfolio

FDIC Study of Bank Overdraft Programs

Probability m odels on horse-race outcomes

Redistributing the Gains from Trade through Non-linear. Lump-sum Transfers

Small Business Cloud Services

Regular Sets and Expressions

AN ANALYTICAL HIERARCHY PROCESS METHODOLOGY TO EVALUATE IT SOLUTIONS FOR ORGANIZATIONS

Portfolio approach to information technology security resource allocation decisions

Enterprise Risk Management Software Buyer s Guide

Abstract. This paper introduces new algorithms and data structures for quick counting for machine

Learner-oriented distance education supporting service system model and applied research

How To Understand The Theory Of Inequlities

ffiiii::#;#ltlti.*?*:j,'i#,rffi

Research Article Competition with Online and Offline Demands considering Logistics Costs Based on the Hotelling Model

Physics 43 Homework Set 9 Chapter 40 Key

Transcription:

Journl of Mchine Lerning Reserch 9 (2008) 2079-2 Submitted 8/08; Published 0/08 Vlue Function Approximtion using Multiple Aggregtion for Multittribute Resource Mngement Abrhm George Wrren B. Powell Deprtment of Opertions Reserch nd Finncil Engineering Princeton University Princeton, NJ 08544, USA Snjeev R. Kulkrni Deprtment of Electricl Engineering Princeton University Princeton, NJ 08544, USA AGEORGE@PRINCETON.EDU POWELL@PRINCETON.EDU KULKARNI@PRINCETON.EDU Editor: Sridhr Mhdevn Abstrct We consider the problem of estimting the vlue of multittribute resource, where the ttributes re ctegoricl or discrete in nture nd the number of potentil ttribute vectors is very lrge. The problem rises in pproximte dynmic progrmming when we need to estimte the vlue of multittribute resource from estimtes bsed on Monte-Crlo simultion. These problems hve been trditionlly solved using ggregtion, but choosing the right level of ggregtion requires resolving the clssic trdeoff between ggregtion error nd smpling error. We propose method tht estimtes the vlue of resource t different levels of ggregtion simultneously, nd then uses weighted combintion of the estimtes. Using the optiml weights, which minimizes the vrince of the estimte while ccounting for correltions between the estimtes, is computtionlly too expensive for prcticl pplictions. We hve found tht simple inverse vrince formul (djusted for bis), which effectively ssumes the estimtes re independent, produces ner-optiml estimtes. We use the setting of two levels of ggregtion to explin why this pproximtion works so well. Keywords: hierrchicl sttistics, pproximte dynmic progrmming, mixture models, dptive lerning, multittribute resources. Introduction We consider the problem of mnging resources (people, equipment) tht cn be described using vector of ttributes = (, 2,..., M ). Our work hs grown out of series of projects with industry nd the militry tht involve mnging resources over time under uncertinty. In ll of these projects, we use lgorithms tht require estimting the mrginl vlue of resource with ttribute vector. As these projects hve mde the trnsition from lbortory experiments to industril implementtions, we hve found tht one nd two dimensionl ttributes (for exmple, loction nd possibly equipment type) quickly grow to five or ten dimensions, with n exponentil growth in the number of potentil ttributes. Exmples of ctul projects we hve worked on which exhibit this behvior include: c 2008 Abrhm George, Wrren B. Powell nd Snjeev Kulkrni.

GEORGE, POWELL AND KULKARNI Mnging pilots for business jets - The ttributes of pilot include elements such s home city, number of dys wy from home nd the equipment tht he is trined to fly. Decisions bout pilots cn include ssigning pilot to prticulr flight, or decision to send pilot for trining on new type of ircrft. Mnging locomotives - The decision to ssign prticulr locomotive to prticulr trin hs to consider ttributes such s the type of locomotive, the number of dys until it hs to be mintined, its current loction nd its home mintennce shop. Mnging fleet of freight crs - Freight crs hve ttributes such s loction, time until rrivl to destintion, loded or empty sttus, ownership, nd mintennce sttus. Mnging fleet of trucks to move lods - The truck cn be described using ttributes such s its current loction, the home domicile of the driver, the mintennce level nd whether it is being driven by solo driver or tem of two drivers. Decisions include where to move to nd whether to move loded or empty. Mnging crgo ircrft for the militry - We hve to decide which ircrft should be ssigned to stisfy prticulr requirement ( movement of freight or pssengers). Choosing the best ircrft requires knowing the vlue of n ircrft t the destintion which depends on the type of ircrft, crgo configurtion, whether it is loded or empty (nd if loded, the lod chrcteristics), nd its mintennce sttus. Mnging blood inventories - Blood is chrcterized by blood type, ge, loction, nd whether it hs been frozen. New supplies of, nd the demnd for, blood is rndom. All of these re exmples of resource lloction problems where decision hs to be mde now to ct on resources (trucks, jets, locomotives) which will bring bout chnge in their ttributes. Let A be the ttribute vector describing resource now. If we ct on the resource, we my produce resource with ttribute with vlue v. In dynmic progrmming setting, the vlue v refers to the solution of finite horizon discounted rewrd dynmic progrm. In prcticl pplictions, we cnnot compute v exctly, so we resort to Monte Crlo methods where we might observe rndom observtions ˆv nd use these to produce sttisticl estimte v (see Bertseks nd Tsitsiklis 996 nd Sutton nd Brto 998 for n introduction to the techniques of pproximte dynmic progrmming). The problem is tht in relistic problems, the ttribute spce A cn be extremely lrge, nd we my obtin only few observtions of ˆv for prticulr. As result, the sttisticl error in v cn be quite lrge. One of the stndrd strtegies in pproximte dynmic progrmming is to ggregte the stte (ttribute) spce. Insted of estimting v, we might define n ggregtion function G() which produces n ggregted ttribute which hs fewer outcomes. For exmple, five-digit zip code cn be ggregted up to three-digit zip; numericl ttribute cn be divided into fewer rnges; or n ttribute cn be completely ignored. The resulting smller ttribute spce produces more observtions of ech ttribute, but t cost of ggregtion error. There re vriety of sttisticl strtegies for estimting vlue functions which tke dvntge of the structure of specific ttribute vector. In trucking problem, we might design sttisticl function tht depends on the loction of driver, his dys wy from home, the fuel level of his tnk nd his home domicile. However, fter designing sttisticl model tht works for this ppliction, 2080

FUNCTION APPROXIMATION WITH MULTIPLE AGGREGATION we would hve to strt from scrtch if we wished to switch to nother ppliction. In fct, simply dding n ttribute would require redesigning nd refitting the sttisticl eqution. This cn be prticulrly hrd when severl of the ttributes re ctegoricl, nd which interct to determine the effect of the ttributes on the system. A truck driver might be chrcterized by his loction nd his home domicile; the vlue of driver t loction depends very much on where he lives. We re interested in developing method for estimting the vlue v of resource with ttribute, mking miniml ssumptions bout the structure of the ttribute spce. We tke dvntge of the fct tht for every ppliction with which we re fmilir, it is quite esy to design fmily of ggregtion functions G where G : A A is n ggregtion of the ttribute spce A. For exmple, we cn crete n ggregtion function simply by ignoring n ttribute. Aside from ssuming the existence of this fmily of functions, we mke no further ssumptions bout the nture of the ttribute spce. For exmple, we do not even require the existence of metric tht would provide mesure of the distnce between two ttribute vectors, which prevents the use of stndrd methods such s non-prmetric sttistics or regression trees. Aggregtion hs trditionlly been powerful technique in dynmic progrmming. A good generl review of ggregtion techniques is given by Rogers et l. (99). Aggregtion strtegies in dynmic progrmming setting my be governed by the desire to solve exctly smller dynmic progrm, or by the itertive nture of the lgorithms. Techniques rnge from picking fixed level of ggregtion (Whitt, 978; Ben et l., 987; Athns et l., 995; Zhng nd Sethi, 998; Wng nd Dietterich, 2000), or using dptive techniques tht chnge the level of ggregtion s the smpling process progresses (Mendelssohn, 982; Bertseks nd Tsitsiklis, 996; Luus, 2000; Kim nd Den, 2003), but which still use single level of ggregtion t ny given time (mny uthors used fixed level of ggregtion to produce smller Mrkov Decision Process (MDP) tht cn be solved optimlly). Tsitsiklis nd Vn Roy (996) (see lso Bertseks nd Tsitsiklis, 996) show how vlue functions cn be pproximted using fixed set of fetures; this strtegy encompsses both sttic nd hierrchicl ggregtion s specil cses, but the use of these techniques in our setting is prohibitive becuse of the extremely lrge number of vlues tht need to be estimted. Feng et l. (2003) presents work tht identifies stte ggregtions bsed on structurl similrity where sttes re considered similr if they hve similr vlue estimtes or similr sets of successor sttes, rther thn input similrity which is typiclly mesured by some distnce metric defined over the stte spce. Bertseks nd Cstnon (989) introduces cretive pproch which dptively clusters sttes with similr vlues of residul errors t ech itertion, requiring no structure mong the sttes of the system. While we lso do not hve ny structure, we do tke dvntge of the presence of fmily of ggregtion functions, nd our technique does not require the overhed of solving clustering problems. A nice discussion of ggregtion nd bstrction techniques in n pproximte dynmic progrmming setting is given in Boutilier et l. (999). Insted of using single level of ggregtion, reserchers hve considered combining estimtes from ll the levels of ggregtion t once. In the literture, there exist severl techniques for combining estimtes to improve ccurcy (see Wolpert, 992; LeBlnc nd Tibshirni, 996; Yng, 200). It is well-known tht if the estimtes being combined re independent nd unbised, then it is optiml to combine them in inverse proportion to their vrinces (Guttmn et l., 965). When different estimtes re bsed on different levels of ggregtion, they re neither independent nor unbised. It is lso possible to use weighted combintion of estimtes where the weights re estimted using regression techniques. For our ppliction, there cn be hundreds of thousnds of such models, mking the updting of regression models computtionlly expensive. 208

GEORGE, POWELL AND KULKARNI In this pper, we solve the problem of optimlly combining (correlted) vlue estimtes t different levels of ggregtion in n pproximte dynmic progrmming setting nd derive expressions for optiml weights. The result generlizes well-known result for optimlly combining independent estimtes. We point out tht the independence ssumptions used for deriving the results re true only in idelized regression settings, nd not in n pproximte dynmic progrmming setting. The mjor contribution of this pper lies in finding tht n inverse-vrince weighting formul (djusted for bis), which is optiml only when the estimtes re independent, proves to be neroptiml even though estimtes t different levels of ggregtion re not independent. We explin this behvior nlyticlly for the cse with two levels of ggregtion. We show tht if we compute optiml weights (without ssuming independence) nd compre the results if we do ssume independence, the results re the sme for two extremes: when the difference between the ggregte nd disggregte vlue estimtes is very lrge or very smll. We show experimentlly tht the error for intermedite vlues is extremely smll. We lso show, in the context of single vehicle routing problem, tht our weighting method produces vlue function estimtes tht re within five to ten percent of the optiml vlue functions, outperforming other estimtes. The method of weighting fmily of ggregte estimtes is shown to nturlly shift the weight from ggregte to disggregte estimtes s the lgorithm progresses. We lso demonstrte tht this method is esy to implement in lrge-scle, on-line lerning pplictions tht rise in pproximte dynmic progrmming, where it produces much fster convergence (which implies pproching consistently better solution qulity in fewer number of itertions) thn would be produced using single, sttic level of ggregtion. Further work on this ppliction is explined in detil in Simo et l. (2008). The pper is orgnized s follows. In Section 2, we describe generic pproximte dynmic progrmming technique, which estimtes the vlue functions ssocited with vrious sttes. This section provides n introduction to the context in which our sttisticl estimtion problem rises. The next three sections, however, focus purely on the sttistics of ggregtion outside of dynmic progrmming setting. Section 3 provides theoreticl model of the smpling process nd defines bis nd vrince for ggregted sttistics. Then, Section 4 poses the problem of computing optiml weights for combining estimtes of vlues t different levels of ggregtion. The problem with this formul is tht it is too expensive to use for our problem clss. For this reson, we propose simpler formul tht ssumes tht sttistics from different levels of ggregtion re independent. In Section 5, we compre the two weighting formuls (with nd without the independence ssumption) for the specil cse where there re only two levels of ggregtion which llows the optiml weights to be computed nlyticlly. We show theoreticlly tht ssuming independent estimtes introduces zero expected error t two extremes of the problem. We then show experimentlly tht ignoring the dependence between the estimtes gives results tht re very similr. In Section 6, we demonstrte our pproximtion method in the context of n pproximte dynmic progrmming lgorithm for solving multittribute resource lloction problem. We use both single truck problem, which cn be solved exctly, s well s problem of mnging lrge fleet of trucks. We provide our concluding remrks in Section 7. 2. Approximte Dynmic Progrmming This section is designed s brief introduction to pproximte dynmic progrmming, nd introduces the context in which our problem rises. Our interest lies in the context of dynmic resource 2082

FUNCTION APPROXIMATION WITH MULTIPLE AGGREGATION lloction problems. Dynmic progrmming techniques cn be pplied to solve these problems which re typiclly modeled s MDPs. Using the nottion of Powell (2007), we let S t be the stte of our system. We lso let d D be type of decision, nd we let x d = if we choose decision d, nd x d = 0 otherwise. x t = (x d ) d D is the vector of decisions tht we mke t time t. Bellmn s eqution llows us to express the vlue of being in stte S t s V (S t ) = mx x t C(S t,x t ) + E{V (S t+ (S t,x t,w t+ )) S t }, where W t+ is rndom vrible representing new informtion tht rrives between t nd t +. The exct vlues cn be determined using trditionl bckwrd dynmic progrmming techniques such s vlue itertion nd policy itertion. In these methods, the vlues re computed recursively strting from the finl stte, mking use of the stte trnsition probbilities. When the stte nd ction spces become lrge, s in most rel-life stochstic plnning problems, it is not prcticl to enumerte the sttes to determine their vlues. In such problems, compct feture-bsed representtions of the MDP, lso clled fctored MDPs (see Boutilier et l., 2000) cn be used to mke the problem computtionlly trctble. Fctored MDPs cn be represented using fctored stte trnsition model nd rewrd function tht is dditive. In these representtions, smller set of vribles (lso clled fetures or ttributes) re used to describe the stte of the system. Dynmic resource lloction problems spn dynmic vehicle routing (Gendreu nd Potvin, 998; Ichou et l., 2005), where there hs been recent interest in the ppliction of pproximte dynmic progrmming for the single vehicle routing problem (Secomndi, 2000, 200). Powell nd Crvlho (998) uses n pproximte dynmic progrmming lgorithm for fleet mngement problem, but the ttributes of the vehicles were very simple. Powell et l. (2002) uses n pproximte dynmic progrmming lgorithm for multittribute resources, but does not ddress sttisticl smpling issues. Spivey nd Powell (2004) pplies pproximte dynmic progrmming for optimizing fleet of vehicles, using liner vlue function pproximtion tht lso requires estimting the vlue of resource chrcterized by vector of ttributes. This reserch estimted the vlue of resource t different levels of ggregtion, but kept trck of the vrince of these estimtes t ech level of ggregtion nd lwys used the estimte tht provided the smllest vrince. Resource lloction problems cn be modeled by letting A be n ttribute vector ( my consist of ctegoricl nd numericl ttributes), nd by letting R t be the number of resources with ttribute. We then let R t = R t ) A be the resource stte vector. This reserch ddresses problems where the vector is lrge enough tht the ttribute spce A becomes too lrge to enumerte. We develop these ides in the context of single entity. If t is the ttribute of the entity t time t, then t is effectively our stte vrible. In this section, we describe the bsic pproximte dynmic progrmming (ADP) strtegy to solve the problem of mnging single resource with multiple ttributes, the nomdic trucker. This is single resource version of the dynmic fleet mngement problem, where there is single trucker who needs to move between vrious loctions to cover lods tht rise nd gins rewrds in the process. The stte of the resource is defined by n ttribute vector,, composed of multiple ttributes, which my be numericl or ctegoricl. For the nomdic trucker problem, exmples of ttributes include the loction of the truck, the home domicile of the driver nd the number of hours driven. We could represent the ttribute vector s = ( loction, time, domicile,...). The stte spce, A, for 2083

GEORGE, POWELL AND KULKARNI this problem would consist of ll possible combintions of the ttributes of the trucker. We cn let the decision be represented by the vector (x d ) d D, but for single entity problem, d D x d =, which mens we cn lso write the problem s choosing decision d D. Typiclly, the set of potentil decisions depends on the current stte (ttributes) of our resource, so we let D be the decisions vilble to resource with ttribute. We ssume tht the impct of decision d on resource with ttribute is deterministic, nd is given by the function = M (,d). In pproximte dynmic progrmming, we smple the vrious sttes by choosing decisions tht re loclly optiml bsed on current estimtes of the vlue functions. For exmple, we could follow procedure where we choose decision tht mximizes the sum of the one-period rewrds nd the future vlue (discounted by fctor γ) s follows: { } d(, ω) = rg mx c(,d,ω) + γv d D (ω) M (,d). Here, ω represents smple reliztion of rndom informtion (for exmple, D (ω) is smple reliztion of the decision set), M (,d) is the stte t the destintion nd v M (,d) the vlue ssocited with M (,d). This model is esily generlized to hndle stochstic trnsitions, but this is not relevnt to the focus of this pper. We outline the steps of typicl pproximte dynmic progrmming lgorithm for the nomdic trucker problem in Figure. This lgorithm hs two stges. In the forwrd pss, we use the current estimtes of the optiml vlue functions to simulte smple trjectory of the truck. The next stte tht is visited is determined using trnsition function M ( m,d m ), s depicted in Eqution 2, where the resource in stte m undergoes trnsformtion to stte m+ = M ( m,d m ) when cted upon by decision d m. Once the end of the time horizon is reched, we perform bckwrd pss, where we first compute the observtions of vlues of the vrious sttes in the current smple pth using Eqution 3. We point out tht the estimtes of the future vlues re discounted by fctor γ. We then use these to updte the vlue estimtes, s in Eqution 4, nd the ssocited sttistics (number of observtions nd smple vrince) of the sttes tht re visited. There re number of vritions of pproximte dynmic progrmming. One fmily is known s TD(λ)-lerning (see Sutton, 988; Sutton nd Brto, 998), typiclly prmeterized by n rtificil discount fctor λ. Using pure forwrd pss lgorithm is equivlent to TD(0), while nother vrition follows policy (determined by the current set of pproximtions), nd then does bckwrd trversl to obtin updtes of the estimte of the vlue of being in ech stte (this is equivlent to T D()). Another populr strtegy is Q-lerning (see Wtkins, 989), where we estimte the quntities Q(,d) which is the vlue of being in stte nd mking decision d. Since the sttisticl problem of estimting the vlue of stte-ction pir is, of course, even hrder thn the problem of estimting the vlue of being in stte, we hve not used this pproch. Since Q-lerning llows you to determine decision directly from the Q-fctors (rther thn solving n optimiztion problem), it is typiclly presented s model-free lgorithm (tht is, one tht does not require n explicit model of the trnsition function), lthough estimting the Q-fctors does require some source tht determines the next stte given stte nd ction. All of these methods cn be used without n explicit model of the exogenous informtion process (for exmple, we do not use one-step trnsition function) s long s we hve some mechnism for creting the smple reliztions. As with most ADP lgorithms, the only wy to obtin n estimte of the vlue of being in stte is to ctully visit the stte. In rel pplictions, there my be millions of sttes but we my be limited to only thousnds of observtions. In prctice, most sttes re never visited, nd mny re 2084

FUNCTION APPROXIMATION WITH MULTIPLE AGGREGATION Step 0. Initilize n pproximtion for the vlue function v 0 for ll ttribute vector sttes A nd set n =. Step. Itertion n: Step 2. Forwrd pss: Set m = 0 nd rndomly smple ttribute vector m, but fixing the strt time t the beginning of the time horizon. Step 3. Obtin the set of possible decisions, D m (ω). Step 4. Solve for the optiml decision, given the current vlue function estimtes. d m (ω) = rgmx d Dm (ω) c( m,d) + γv n M ( m,d) Step 5. Evlute the next stte to visit: () m+ = M ( m,d m ) (2) Step 6. If the end of the time horizon (T ) is reched, then set m = m+ nd go to step 3, else go to step 7. Step 7. Bckwrd pss: For j = m,m 2,,0, updte the vlue function estimtes s follows: ˆv n j = c( j,d j ) + γ ˆv n j+ (3) v n j = ( α)v n j + α ˆv n j (4) Step 8. Let n = n +. If n < N go to step, else for ech stte, return the vlue function v n. Figure : An pproximte dynmic progrmming lgorithm using bckwrd pss for the nomdic trucker problem visited only few times. As result, there cn be high level of sttisticl noise in our estimtes of the vlue of being in stte. This section provides the context in which our dptive lerning problem rises. The next three sections consider the generl problem of estimting quntity (the vlue of resource with ttribute ) outside of the context of pproximte dynmic progrmming. We ssume we hve source of (unbised) observtions of the vlue ssocited with ttribute, from which we hve to develop sttisticlly robust (i.e., low-vrince) estimtes of the vlue ssocited with ttribute. We then use the method in the context of pproximte dynmic progrmming to demonstrte tht it produces better results thn other methods, even though we no longer hve unbised observtions. 2085

GEORGE, POWELL AND KULKARNI 3. The Sttistics of Aggregtion In this section, we investigte the sttistics of ggregtion by studying smpling process where t itertion n we first smple the ttribute vector = â n. We then use smple reliztion of the rndom informtion which provides us with n unbised observtion of the vlue of the resource ˆv n, producing sequence of observtions of (ttribute vector, vlue) pirs. We wish to use this informtion to produce sttisticlly relible estimte of the true vlue ssocited with. The nlysis in this section is not done in the context of dynmic progrmming (which llows us to ssume tht our observtions of vlues re unbised). Rther, it is intended s pure study of the sttistics of ggregtion. Our ssumption tht the observtions of vlues, ˆv n, re unbised will not be true in dynmic progrmming setting, but llows us to focus on the trdeoff between bis nd vrince. We begin by defining the following: N = The set of indices corresponding to the observtions of the ttribute vectors nd vlues. S = A smple of observtions (â n, ˆv n ) n N. ν = The true vlue ssocited with ttribute vector. N = The number of observtions of ttribute vector given our smple S. â n = The ttribute vector t observtion n. ˆv n = The observtion of the vlue corresponding to index n. {â n =} =, if the nth observtion is of ttribute vector. An estimte of ν cn be obtined s n verge cross ll the observtions of vlues corresponding to : v = N n N ˆv n {â n =}. Throughout our presenttion, we use the ht nottion (s in ˆv) to represent exogenous informtion, nd brs (s in v) to represent sttistics derived from exogenous informtion. Consider cse where the ttribute vector hs more thn one dimension, with A i denoting the number of distinct sttes tht ttribute i cn ssume. The number of vlues tht need to be estimted is i A i. Needless to sy, s the ttribute vector grows, the stte spce grows exponentilly, mking it impossible to obtin sttisticlly relible estimtes. One strtegy is to resort to ggregtion (such s dropping one or more dimensions of ) which cn quickly reduce the number of vlues but introduces structurl error. An lterntive is to ssume structurl property such s seprbility, which reduces the number of vlues to be estimted to i A i. This hs fewer vlues, but requires tht we introduce seprbility s n pproximtion. In one of our trucking pplictions, one ttribute is the loction of the truck, while second ttribute is the driver s home domicile. The vlue of driver in loction depends very much on his home domicile. Assuming these re independent would introduce significnt errors. In generl, ggregtion of ttribute vectors is performed using collection of ggregtion functions, G g : A A, where A represents the g th level of ggregtion of the ttribute spce A. We define the following: = G g (), the g th level ggregtion of the ttribute vector. 2086

FUNCTION APPROXIMATION WITH MULTIPLE AGGREGATION Figure 2: Aggregtion of the stte spce for multittribute problem. G = The set of indices corresponding to the levels of ggregtion. Aggregtion cn thus be used to crete sequence of stte spces, { A,g =,2,..., G }, with fewer elements thn the originl stte spce. This cn be better illustrted using the exmple in Figure 2, where we consider the nomdic trucker problem with the stte of the truck defined by two ttributes - current loction nd cpcity type. The number of possible sttes with three loctions (NY, NJ nd PA), nd three cpcity types (C, C2 nd C3), is nine t the most disggregte level. The first-level ggregtion function, G (), involves ggregting the loction to the regionl level which reduces the number of sttes to three. The second-level ggregtion function, G (2), would be defined s ggregting out the cpcity type ttribute completely, which leves us with single stte. As in this exmple nd in the experimentl work to follow, it is usully the cse tht the gth level of ggregtion cts on the (g )st level. We let ε n denote the error in the nth observtion with respect to the true vlue ssocited with â n (which, using the nottion defined erlier in this section, would be represented using νân). For nlysis purposes, we ssume tht the elements of the sequence {ε n } n N re independent nd identiclly distributed, with men vlue of zero. This is, of course, n ideliztion, but it will help us understnd the trdeoffs between structurl errors (due to ggregtion) nd sttisticl errors. We cn express the observed vlue s follows: We define the following probbility spces, ˆv n = νân + ε n. Ω = The set of outcomes of observtions of ttribute vectors. Ω ε = The set of outcomes of observtions of the errors in the vlues. Ω = The overll set of outcomes = Ω Ω ε. ω = (ω,ω ε ) = An element of the outcome spce. 2087

GEORGE, POWELL AND KULKARNI We now define the following terms which will be useful in obtining n estimte of the vlue ssocited with the ttribute vector t ny level of ggregtion: N = The set of indices tht correspond to observtions of the ttribute vector t the gth level of ggregtion = {n G g (â n ) = G g ()}. N =. v N = The estimte of the vlue ssocited with the ttribute vector t the gth level of ggregtion, given the smple, N. We cn compute the estimte, v, s v = N ˆv n. n N We provide numericl exmple to illustrte the ide of forming estimtes t different levels of ggregtion. Consider the stte of resource to be composed of two ttributes, nmely, loction of the resource nd resource type. There re four loctions, nmely, New York, Phildelphi, Boston nd Wshington. The type cn be Single or Tem. Thus, there re eight possible sttes. We use ggregtion functions tht ggregte out the type ttribute nd then the loction ttribute to obtin three different levels of ggregtion. Suppose we hve the following observtions of stte-vlue Loction Type N v N () v () N (2) v (2) New York Single 2 4.5 2 New York Tem 7.0 3 5.3 3 Phildelphi Single 3 3.7 4 Phildelphi Tem 2.0 4 3.3 2 4.8 5 Boston Single 2 8.5 6 Boston Tem 0-2 8.5 7 Wshington Single.0 8 Wshington Tem 2 5.5 3 4.0 Tble : Numericl exmple illustrting the computtion of vlue estimtes using ggregtion. For exmple, v (0) = (7 + 2)/2 = 4.5, nd v () 7 = (5 + + 6)/3 = 4.0. pirs - {( 3,4),( 4,2),(,7),( 5,8),( 3,2),( 8,5),( 7,),( 5,9),( 8,6),( 2,7),( 3,5),(,2)}. We cn form estimtes of the vlues of the vrious sttes t the different levels of ggregtion s illustrted in Tble. Now tht we hve n estimte of the vlue, ν, for ech level of ggregtion, the question rises s to wht is the best level of ggregtion. A trditionl strtegy is to choose the right level of ggregtion by trding off sttisticl nd structurl errors to find model with the lest overll error. 2088

FUNCTION APPROXIMATION WITH MULTIPLE AGGREGATION In order to better understnd these two kinds of errors in n ggregtion setting, we first let δ denote the totl error in the estimte, v, from the true vlue ssocited with ttribute vector : δ = v ν. An importnt component of our prediction error will be ggregtion bis. Consider our most recently observed ttribute vector â n nd some other ttribute, where â n nd my ggregte up to the sme ggregted ttribute t some level g G, tht is, G g () = G g (â n ) (for the moment, these re simply two ttribute vectors). In our derivtions below, it is useful to define bis term, µ n = νân ν. We cn use this nottion to rewrite ˆv n s follows: We cn express v We let, ˆv n = ν + (νân ν ) + ε n = ν + µ n + ε n,n. in terms of its bis nd noise components s follows: v = N = ν + n N N (ν + µ n + ε n ) n N µ n + N n N ε n. µ = ε = N N µ n, n N.ε n. n N This enbles us to express the totl error s follows: δ = µ + ε (5) where µ gives n estimte of the bis between the vlues of t the gth level of ggregtion nd t the disggregte level. µ is rndom vrible tht is function of the set of points smpled. ε is n estimte of the rndom error tht hs zero expected vlue. By ssumption, the vribility in ε occurs becuse of the sttisticl noise in the observtion of the vlues. We point out tht the terms, δ, µ nd ε, re not sttisticl estimtors, becuse knowledge of the true vlues is required for computing these. µ is representtive of the structurl error tht is introduced due to ggregtion, while ε represents the sttisticl error due to noise in the observtions. Moreover, these two error terms need not be uncorrelted in generl setting. 2089

GEORGE, POWELL AND KULKARNI In n pproximte dynmic progrmming setting, the right trdeoff between sttisticl nd structurl errors will chnge s we collect more observtions. Furthermore, we generlly do not control the smpling process of the ttributes, nd we will encounter instnces where some regions of the ttribute spce A will be smpled more thn others. Although it is common in prctice to choose single level of ggregtion tht produces the lower overll error, it cn be useful to combine estimtes from severl levels of ggregtion. 4. Combining Estimtes In this section, we propose methods to compute weights to combine vlue estimtes tht hve been formed from given set of observtions. In the context of ADP, the weights re computed t given itertion of the lgorithm in Figure. { We consider set of estimtes, v,g G }, of vlue, ν, t different levels of ggregtion. We let σ denote the popultion stndrd devition ssocited with the observtions used to compute v. Breimn (996) proposes method clled stcked regression which in our setting would be equivlent to combining estimtes t different levels of ggregtion using v = w v, g G where w is set of weights for ech level of ggregtion. This method ignores the importnt feture tht the best weighting depends on how mny times we hve observed prticulr ttribute. We prefer to use the strtegy suggested by LeBlnc nd Tibshirni (996) (Section 8), where the weights depend on the ttribute: v = g G w v. The prcticl chllenge here is tht we hve to estimte set of weights (w ) for ech ttribute (tht we observe). If we use clssicl regression methods for our pplictions, this cn men mintining hundreds of thousnds of regression models. Storing nd updting these models is computtionlly demnding for lrge industril pplictions. In this section, we develop both exct nd pproximte methods for estimting weights, where our pproximtion mkes the ssumption tht the estimtes v re independent. Section 5 presents theoreticl nd experimentl rguments supporting the ccurcy of the weights when we ssume independence (even when the ssumption is not even pproximtely true), which drmticlly simplifies the procedure. In Section 4., we formulte the problem of finding the optiml weights for the generl cse where the estimtes my be bised nd dependent on ech other, nd lter derive these weights. However, the computtion of these weights cn prove cumbersome for lrge-scle problems. We provide, in Section 4.2, simpler formul for the weights which ssumes tht the estimtes re independent, but ccounts for the possibility of bises in the estimtes. In Section 4.3, we propose n pproximtion of the weights derived in Section 4.2, for the cse where the bis nd vrince of the estimtors re unknown. 2090

FUNCTION APPROXIMATION WITH MULTIPLE AGGREGATION 4. Optiml Weights We begin by finding the weighting scheme tht will optimlly combine the estimtes t the different levels of ggregtion, tht is, the weights which give combined estimte with the lest squred devition from the true vlue ssocited with ttribute vector. We cn formulte the problem s follows: subject to: w min,g G E 2 ( g G w v ) 2 ν, (6) w =. (7) g G In setting where the estimtes re unbised, it is useful to hve n ffine combintion of the estimtes (LeBlnc nd Tibshirni, 996, Section 2) since the individul estimtes nd hence the ffine combintion re equl to the true vlue in expecttion. Even though this is not necessrily true in generl setting, we choose to retin this constrint. We stte the following proposition for computing the optiml weights tht solves the problem formulted in Equtions 6-7: Proposition For given ttribute vector,, the optiml weights, w, g G, to combine individul estimtes tht re correlted in hierrchicl fshion, re obtined by solving the following system of liner equtions in (w,λ): g G w E δ δ (g ) λ = 0 g G, (8) w =. (9) g G If the bis error, µ, is uncorrelted with the rndom error, ε, then the coefficients of the weights in Eqution 8 cn be expressed s follows: E δ δ (g ) = E µ µ (g ) + σ2 ε N (g ) where σ 2 ε denotes the vrince of the sttisticl noise in the observtions. g g nd g,g G (0) Proof: The proof is given in ppendix A. The derivtion of Eqution 8 involves using the Lgrngin for the problem stted in Equtions 6-7 nd performing some simple rithmetic on the corresponding first order optimlity conditions. Eqution 9 is identicl to Eqution 7 from the optimiztion formultion. In the reminder of this nlysis, our computtions will be conditionl on given sequence of observed ttribute vectors. In other words, ll expecttions nd probbilities re computed with respect to the probbility spce, Ω ε. We prove Eqution 0 by simplifying the expression E δ δ (g ) using some properties of hierrchicl ggregtion. 209

GEORGE, POWELL AND KULKARNI For the cse where g = 0, we cn use the result, E µ (0) µ (g ) = 0 (which follows from the property: µ (0) = 0), to further simplify (0) nd obtin the following result: E δ (0) δ (g ) We refer to the optiml weighting scheme s WOPT. = σ2 ε N (g ),g G } were independent nd unbi- 4.2 An Approximtion Assuming Independence { It is well-known result in sttistics tht if the estimtes sed, then the optiml weights would be given by w = σ 2 /N g G σ (g ) v. () 2 (g /N ). (2) { We cn obtin this result from proposition s follows. If we ssume tht the estimtes v,g G } re independent nd unbised, then the cross-terms in Eqution 8 dispper, leving behind the following modified reltion: ( w E δ ) 2 λ = 0 g G. (3) Solving Equtions 3 nd 9 gives us weights tht re inversely proportionl to the expected squred ( errors, E δ ) 2 (. For the cse of independent, unbised estimtes, E δ ) 2 is identicl to the vrince, σ 2 /N. Solving the system of equtions in Proposition cn be computtionlly expensive since in prctice, there my be hundreds of thousnds of models. For prcticl solutions, it will be useful to hve n expression long the lines of Eqution 2 for computing the weights, even though neither of the conditions (independence nd bsence of bis) holds true for estimtes tht rise from ggregtion due to structurl errors introduced in the process of ggregtion. In order to dpt the simpler formul in (3) to the ggregtion setting while cknowledging the bis, we first define: µ = Expected bis in the estimte, v = E v ν. For bised estimtes, the totl squred error cn be expressed s the sum of bis nd vrince components, provided the bis nd vrince re independent of ech other (Hstie et l., 200, p. 24): E ( δ ) 2 2 = σ N + µ 2. (4) 2092

FUNCTION APPROXIMATION WITH MULTIPLE AGGREGATION We use this reltion to modify the weights s follows: w = σ 2 N + µ 2 We cll this weighting scheme, WIND. g G σ (g ) 2 N (g ) + µ (g ) 2 g G. (5) 4.3 Weighting by Inverse Men Squred Errors In the more relistic setting where the exct vlues of the prmeters involved in the computtion of weights s in Eqution 5 re unknown, we propose using the plug-in principle (see, for exmple, Efron nd Tibshirni 993, chpter 4) where we use sttisticl estimtes of the bis nd vrince to produce pproximtions of the weights. We first compute estimtes of the bis nd the vrince using s 2 µ = The smple vrince of the observtions corresponding to the estimte v = ( ) N ˆv n v 2. n N = An estimte of the bis in the estimted vlue (v ) from the true vlue = v v (0). The pproximte weights on the estimtes t different levels of ggregtion re inversely proportionl to the estimtes of their men squred devitions (obtined s the sum of the vrinces nd the bises) from the true vlue: w = s 2 + µ 2 N g G s (g ) 2 N (g ) + µ (g ) 2 g G. (6) We refer to this formul s weighting by inverse men squred errors (WIMSE). In the event tht N is too smll or zero (which cn hppen in the erly itertions nd/or t the more disggregte levels), it is difficult to form meningful estimtes of the vrince nd bis. In such sitution, we set the corresponding weight to zero. Eqution 6 is very esy to clculte even for lrge scle pplictions where we my observe hundreds of thousnds of ttributes. However, it produces the best results only when the estimtes of vlues t different levels of ggregtion re independent, n ssumption tht we cnnot expect to hold true. In the next section, we present theoreticl nd experimentl evidence supporting the clim tht the error introduced from this ssumption is negligible. It is importnt to note tht the use of the plug-in principle, which in this setting mens using sttisticl estimtes of prmeters (the bis nd vrince), my result in some unexpected behvior when the number of observtions is smll. For exmple, the estimte of the totl squred error in Eqution 4 would be expected to decrese with ech dditionl observtion. When we use estimtes of the bis nd vrince, this is no longer gurnteed, especilly when N is smll. However, our empiricl evidence is tht it seems to behve s expected in n ggregte sense. 2093

GEORGE, POWELL AND KULKARNI 5. The Cse for Assuming Independence In this section, we justify our decision to ignore the dependence between the estimtes from hierrchicl ggregtion, while combining them to form n improved estimte. We discuss the specil cse where we combine estimtes from only two levels of ggregtion, which enbles us to obtin simple expressions for computing the vrious prmeters. We ssume tht the sttisticl noise is independent of the ttribute vector smpled nd lso tht we know the probbility distributions of the smpling of the ttribute vectors nd their vlues. These ssumptions enble us to solve the optimlity equtions to obtin solution explicitly. In Section 5., we nlyticlly compre the two sets of equtions (with nd without ssuming independence) for computing optiml weights. We provide n experimentl comprison of the two methods, demonstrting the similrity in results, in section 5.2. 5. Anlyticl Comprison For the two-level problem, we cn obtin the optiml weights (WOPT) by solving the following system of equtions: E δ (0)2 E δ (0) δ () w (0) + E w (0) + E δ (0) δ () δ ()2 w () λ = 0, w () λ = 0, w (0) + w () =, w (0),w () 0. Since we re concerned with computing the weights for prticulr ttribute vector, we drop the index in the following nlysis. We obtin the vlue of w (0) s, w (0) = E δ ()2 E δ (0) δ () E δ (0)2 + E δ ()2 2E δ (0) δ (). (7) By ssumption, the estimte t the disggregte level is unbised, tht is, µ (0) = 0. We let µ 2 = E µ ()2 denote the expected vlue of the squre of the bis term t the ggregte level. Using Equtions 0 nd, we my write, E δ (0) δ () = σ2 ε N (), E δ (0)2 = σ2 ε N (0), E δ ()2 = E µ ()2 + E = µ 2 + σ2 ε N (). 2094 ε ()2

FUNCTION APPROXIMATION WITH MULTIPLE AGGREGATION These results enble us to rewrite Eqution 7 for computing the weights on the disggregte estimte using the WOPT scheme (which we denote s w opt ) s follows: w opt = + ( N (0) N () ) σ 2 ε µ 2. (8) The competing scheme, WIND, ssumes independence of the estimtes. The weights t the disggregte level re obtined using the formul: w ind = + σ 2 N () ε µ 2 + ( N (0) + N () ) σ 2 ε µ 2. (9) We denote by ṽ opt nd ṽ ind the estimtes computed using the two weighting schemes. ṽ opt = w opt v (0) + ( w opt )v (), ṽ ind = w ind v (0) + ( w ind )v (). We cn write the difference between the estimtes of the vlue obtined with nd without the independence ssumption s = ṽ opt ṽ ind = w v, where w = w opt w ind nd v = v (0) v (). The following proposition estblishes tht is smll under certin conditions. Proposition 2 (i) lim µ 0 E = 0, (ii) lim µ = 0, (iii) lim σ 2 0 = 0. Proof: (i) As µ 0, w opt = 0 nd w ind = N (0) / ( N (0) + N ()). w ind ttins mximum vlue of /2 when N (0) = N (), but tht would imply tht v (0) = v () v = 0. At the other extreme, if N (0) = 0, then w ind = 0 w = 0. For intermedite vlues of N (0), it is no longer true tht the rndom vrible v will lwys be zero (for sttisticl resons), but we cn show tht its expecttion will be zero using E = E{E N }, N E N (0) = E v N N (0) + N () N (0) = E v N (0) + N () = 0. Since µ 2 = 0, E v = 0 nd Eqution 20 follows. (ii) As µ, w ind w opt, which cn be esily obtined by pplying the pproprite limits in Equtions 8 nd 9. This is intuitive since with very high bis, the best strtegy is to put ll the weight on the most disggregte level. As result, w 0. (iii) As the vrince goes to zero, w (ind) w (opt) tht gin implies w 0. Thus, the error from the independence ssumption is smll when the bis is high or low, or when the vrince is low. The error will be highest for moderte vlues of the bis nd higher vlues of the vrince. Given tht the errors vnish for the extreme cses, it is perhps not surprising tht the errors re never very lrge. We provide experimentl evidence to support this conclusion in the next section. 2095

GEORGE, POWELL AND KULKARNI Aggregte cell 0 V V v Aggregte cell 2 Aggregte cell 3 2 3 4 5 6 7 8 9 0 Figure 3: A piecewise constnt function with its ggregte pproximtion. Estimtes of vlues of ech ttribute vector re computed t both the ggregte nd disggregte levels. A weighted verging is done to improve the estimtes. 5.2 Experimentl Results In this section, we nlyze the estimtion of functions chrcterized by known prmeters (which effectively requires tht we know the ctul function) in order to demonstrte the effectiveness of the optiml weighting strtegy, s well s to serve s benchmrk for the strtegy which ssumes independent estimtes t different levels of ggregtion. We observe tht the weights given by either method (Equtions 8 nd 9) re functions of the bis in the vlue t the ggregte level, the vrince of the sttisticl noise in the observtion of the vlues nd the number of observtions t either level. In order to compre the vlues of the weights from the competing strtegies, we crete scenrios with different combintions of the prmeters tht would produce significnt chnges in the weights. We then nlyze how the vritions in the weights given by WOPT nd WIND ffect the ctul function estimtes computed using the two schemes. We consider piecewise constnt monotone function nd its ggregte version s shown in Figure 3. We note tht there re distinct regions in the domin where the bis is high, intermedite nd zero - we expect the reltive weights to be very different in these three regions. Figure 4 gives the weights (to be pplied to the disggregte level) produced by the optiml formul, WOPT, nd the formul ssuming independence, WIND, for ech ttribute. The weights re obtined by smpling since the corresponding Equtions (8 nd 9) require the number of observtions t the two levels of ggregtion, N (0) nd N (). As we would expect, the optiml weights t the disggregte level re zero when there is no structurl error, in contrst to WIND. When the structurl error is highest, the weights produced by the two methods re very similr. Note, however (consistent with our understnding from the previous section) tht the weights re lso quite different for the cells = 2 nd = 5 where the ggregte nd disggregte functions re most similr (which mens the bis is smll). It is lso the cse tht the weight to be given to the disggregte level is lso smllest when the bis is smllest. 2096

FUNCTION APPROXIMATION WITH MULTIPLE AGGREGATION 0.9 0.8 Aggregte cell WOPT WIND Averge Weights 0.7 0.6 0.5 0.4 0.3 0.2 0. Aggregte cell 2 Aggregte cell 3 0 2 3 4 5 6 7 8 9 0 Figure 4: Comprison of the weights over the function domin We hve illustrted the difference in the weights produced by the two strtegies, but less obvious is the difference in the estimtes of the underlying function. In order to compre the two schemes, we developed mesure of the degree to which weighting strtegy reduced the vrince of n estimte. We define the following: ṽ s = The vlue of the ttribute vector s estimted by strtegy s. ε s = The sum of squred errors s estimted by strtegy s. = (ṽ s ν ) 2 A ε G = The sum of squred errors using the sttic ggregtion strtegy which trets the function s constnt over its domin. θ s = The performnce mesure for strtegy s. = εs ε G. θ s mesures the degree of vribility explined by prticulr weighting strtegy reltive to using single constnt which cn be thought of s defult strtegy where ll observtions re ggregted together. θ s is nlogous to n R 2 mesure commonly used in sttistics. A mjor fctor in the performnce of weighting strtegy is the reltive size of the structurl vrition compred to the sttisticl noise. For this purpose, we define n index, ρ, tht mesures the rtio of the noise to the bis. Figure 5 compres the performnces of the two weighting strtegies for three levels of noise. We observe tht the performnce of WOPT nd WIND re lmost identicl even though there were situtions where the weights given by the two schemes were significntly different. The similrity in the function estimtes from the two strtegies is explined by the nlysis in Section 5.. We tested the reltive performnce of the two methods for other function clsses. We summrize the results in figure 6 where we plot the performnce mesure s function of the verge number of observtions per disggregte cell. We observe tht there is very little sttisticl difference between 2097

GEORGE, POWELL AND KULKARNI 0.9 ρ = Performnce mesure (θ s ) 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0. ρ = 2 ρ = 5 ρ = WOPT ρ = WIND ρ = 2 WOPT ρ = 2 WIND ρ = 5 WOPT ρ = 5 WIND 0 0 0 20 30 40 50 60 70 80 90 00 Number of observtions Figure 5: Comprison of the performnce, s mesured by θ s, of WOPT nd WIND in estimting the piecewise constnt function Performnce mesure (θ s ) 0.9 0.8 0.7 0.6 0.5 Liner Concve Sinusoidl Rndom Concve non monotone 2 overlpping lines denoting θ s for WOPT (solid line) & WIND (dshed line) 0.4 0 2 3 4 5 6 7 8 9 0 Averge number of observtions per disggregte cell Figure 6: Comprison of WOPT nd WIND for vrious function types using expected vlues of weights. The grph shows the verge performnce mesure (θ s ) over 000 smples for moderte vlue of ρ = 2. WOPT is represented using solid lines nd WIND, with dshed lines - the two re virtully indistinguishble. the performnce of the two methods. From this nlysis, we conclude tht WIND, which combines estimtes ssuming independence, will generlly be close pproximtion of WOPT. Of prticulr interest for our problem setting is tht WIND is much esier to implement. 2098

FUNCTION APPROXIMATION WITH MULTIPLE AGGREGATION 6. Experiments in n ADP Appliction We implemented the hierrchicl weighting strtegy in the pproximte dynmic procedure for solving the nomdic trucker problem described in Section 2. In Section 6., we describe the specifics of the problem instnces tht we consider. We lso stte the competing strtegies tht we compre in the experiments tht follow. We then proceed to show the effectiveness of our hierrchicl weighting scheme using two sets of experiments. In Section 6.2, we report on experiments where the discount fctor is set to zero. In this cse, the observtions of vlues re unbised, since they do not involve the estimtes of vlues of future sttes. In Section 6.3, we present the results of experiments with positive discount fctors. We hve mde vilble collection of dt sets used in these experiments on the following webpge - http://cstlelb.princeton.edu/. Finlly, in Section 6.4, we provide experimentl results from pplying our techniques on n industril strength problem. 6. Experimentl Design We consider problem where we specify the stte of the truck using three ttributes, nmely, the current loction, the dy of week nd the number of dys wy from home. The problem is rich enough to offer interesting opportunities for hierrchicl ggregtion, but smll enough tht we cn solve the problem to obtin the exct solution. The decisions re to be mde over finite time horizon of 2 time periods. The loction ttribute cn be represented t two degrees of resolution - regions (estern Pennsylvni, northern New Jersey) or geogrphicl res (Northest, Midwest nd so on). There re 50 loctions t the region level which cn be ggregted to 0 geogrphicl res. The mjor contributor to the stochstic nture of the nomdic trucker problem is the uncertinty in the vilbility of lods in ny prticulr loction to be moved to other loctions. The probbility tht lod is vilble to be moved from one loction to nother is dependent on the origin-destintion pir. Another fctor tht influences the lod vilbility is the dy of week. Lods re more likely to pper during the beginning of the week (Mondys) nd towrds the end (Fridys). We use probbility distribution whereby the lod vilbility dips during the middle of the week nd is lower over the weekends. We introduce further uncertinty into the problem by llowing the one-period contributions to be modertely noisy. The finl ttribute tht we consider is the number of dys tht the driver is wy from home. There is penlty tht we impose on moves tht keep the driver wy from his home domicile, which is qudrtic function of the number of dys wy from home. In order to keep the stte spce mngeble (so we cn obtin optiml solutions), we cp the number of dys wy from home t 2. In Tble 2, we list the ggregtions tht we use for the problem nd the number of ttribute sttes t ech level of ggregtion. For exmple, t ggregtion level, the loction ttribute is ggregted from 50 regions to 0 geogrphicl res. We ggregte out the dy-of-week ttribute nd retin the dys-wy-from-home ttribute. Adding in the fctor for 2 time periods, we hve totl of 254 possible sttes. The pprent discrepncy in the size of the stte spce t levels 0 nd rises becuse the dys-wy-from-home ttribute is lwys set to 0 for the loction corresponding to the home of the driver, while for ll the other loctions it cn be ny number from to 2. In order to compute the true vlues ssocited with ech ttribute vector, we use stndrd bckwrd dynmic progrmming lgorithm. Our focus is on the problem of sttisticl estimtion of the true vlues of the vrious sttes. In order to form estimtes of these vlues we incorporte 2099