Machine Learning with Operational Costs



Similar documents
Point Location. Preprocess a planar, polygonal subdivision for point location queries. p = (18, 11)

DAY-AHEAD ELECTRICITY PRICE FORECASTING BASED ON TIME SERIES MODELS: A COMPARISON

ENFORCING SAFETY PROPERTIES IN WEB APPLICATIONS USING PETRI NETS

Branch-and-Price for Service Network Design with Asset Management Constraints

6.042/18.062J Mathematics for Computer Science December 12, 2006 Tom Leighton and Ronitt Rubinfeld. Random Walks

A MOST PROBABLE POINT-BASED METHOD FOR RELIABILITY ANALYSIS, SENSITIVITY ANALYSIS AND DESIGN OPTIMIZATION

Project Management and. Scheduling CHAPTER CONTENTS

Service Network Design with Asset Management: Formulations and Comparative Analyzes

On the predictive content of the PPI on CPI inflation: the case of Mexico

The impact of metadata implementation on webpage visibility in search engine results (Part II) q

Multiperiod Portfolio Optimization with General Transaction Costs

Web Application Scalability: A Model-Based Approach

An important observation in supply chain management, known as the bullwhip effect,

The risk of using the Q heterogeneity estimator for software engineering experiments

POISSON PROCESSES. Chapter Introduction Arrival processes

Risk in Revenue Management and Dynamic Pricing

Asymmetric Information, Transaction Cost, and. Externalities in Competitive Insurance Markets *

Joint Production and Financing Decisions: Modeling and Analysis

Time-Cost Trade-Offs in Resource-Constraint Project Scheduling Problems with Overlapping Modes

Large-Scale IP Traceback in High-Speed Internet: Practical Techniques and Theoretical Foundation

Drinking water systems are vulnerable to

Concurrent Program Synthesis Based on Supervisory Control

IEEM 101: Inventory control

Large firms and heterogeneity: the structure of trade and industry under oligopoly

A Simple Model of Pricing, Markups and Market. Power Under Demand Fluctuations

THE RELATIONSHIP BETWEEN EMPLOYEE PERFORMANCE AND THEIR EFFICIENCY EVALUATION SYSTEM IN THE YOTH AND SPORT OFFICES IN NORTH WEST OF IRAN

The fast Fourier transform method for the valuation of European style options in-the-money (ITM), at-the-money (ATM) and out-of-the-money (OTM)

Risk and Return. Sample chapter. e r t u i o p a s d f CHAPTER CONTENTS LEARNING OBJECTIVES. Chapter 7

X How to Schedule a Cascade in an Arbitrary Graph

Two-resource stochastic capacity planning employing a Bayesian methodology

The Online Freeze-tag Problem

On Software Piracy when Piracy is Costly


A Multivariate Statistical Analysis of Stock Trends. Abstract

SQUARE GRID POINTS COVERAGED BY CONNECTED SOURCES WITH COVERAGE RADIUS OF ONE ON A TWO-DIMENSIONAL GRID

On Multicast Capacity and Delay in Cognitive Radio Mobile Ad-hoc Networks

Managing specific risk in property portfolios

On-the-Job Search, Work Effort and Hyperbolic Discounting

Storage Basics Architecting the Storage Supplemental Handout

Jena Research Papers in Business and Economics

An inventory control system for spare parts at a refinery: An empirical comparison of different reorder point methods

Re-Dispatch Approach for Congestion Relief in Deregulated Power Systems

Service Network Design with Asset Management: Formulations and Comparative Analyzes

Title: Stochastic models of resource allocation for services

Automatic Search for Correlated Alarms

Monitoring Frequency of Change By Li Qin

COST CALCULATION IN COMPLEX TRANSPORT SYSTEMS

CFRI 3,4. Zhengwei Wang PBC School of Finance, Tsinghua University, Beijing, China and SEBA, Beijing Normal University, Beijing, China

Effect Sizes Based on Means

CABRS CELLULAR AUTOMATON BASED MRI BRAIN SEGMENTATION

Comparing Dissimilarity Measures for Symbolic Data Analysis

Computational Finance The Martingale Measure and Pricing of Derivatives

F inding the optimal, or value-maximizing, capital

Precalculus Prerequisites a.k.a. Chapter 0. August 16, 2013

Sage HRMS I Planning Guide. The HR Software Buyer s Guide and Checklist

The Economics of the Cloud: Price Competition and Congestion

Factoring Variations in Natural Images with Deep Gaussian Mixture Models

Beyond the F Test: Effect Size Confidence Intervals and Tests of Close Fit in the Analysis of Variance and Contrast Analysis

Softmax Model as Generalization upon Logistic Discrimination Suffers from Overfitting

Introduction to NP-Completeness Written and copyright c by Jie Wang 1

NAVAL POSTGRADUATE SCHOOL THESIS

Static and Dynamic Properties of Small-world Connection Topologies Based on Transit-stub Networks

Synopsys RURAL ELECTRICATION PLANNING SOFTWARE (LAPER) Rainer Fronius Marc Gratton Electricité de France Research and Development FRANCE

A Modified Measure of Covert Network Performance

Finding a Needle in a Haystack: Pinpointing Significant BGP Routing Changes in an IP Network

An Introduction to Risk Parity Hossein Kazemi

FDA CFR PART 11 ELECTRONIC RECORDS, ELECTRONIC SIGNATURES

CRITICAL AVIATION INFRASTRUCTURES VULNERABILITY ASSESSMENT TO TERRORIST THREATS

An Efficient Method for Improving Backfill Job Scheduling Algorithm in Cluster Computing Systems

Pinhole Optics. OBJECTIVES To study the formation of an image without use of a lens.

A Virtual Machine Dynamic Migration Scheduling Model Based on MBFD Algorithm

Web Inv. Web Invoicing & Electronic Payments. What s Inside. Strategic Impact of AP Automation. Inefficiencies in Current State

Characterizing and Modeling Network Traffic Variability

Working paper No: 23/2011 May 2011 LSE Health. Sotiris Vandoros, Katherine Grace Carman. Demand and Pricing of Preventative Health Care

Stochastic Derivation of an Integral Equation for Probability Generating Functions

An Associative Memory Readout in ESN for Neural Action Potential Detection

Local Connectivity Tests to Identify Wormholes in Wireless Networks

Improved Algorithms for Data Visualization in Forensic DNA Analysis

The Economics of the Cloud: Price Competition and Congestion

Load Balancing Mechanism in Agent-based Grid

Buffer Capacity Allocation: A method to QoS support on MPLS networks**

Multi-Channel Opportunistic Routing in Multi-Hop Wireless Networks

A Novel Architecture Style: Diffused Cloud for Virtual Computing Lab

C-Bus Voltage Calculation

c 2009 Je rey A. Miron 3. Examples: Linear Demand Curves and Monopoly

The Competitiveness Impacts of Climate Change Mitigation Policies

Compensating Fund Managers for Risk-Adjusted Performance

Moving Objects Tracking in Video by Graph Cuts and Parameter Motion Model

SOME PROPERTIES OF EXTENSIONS OF SMALL DEGREE OVER Q. 1. Quadratic Extensions

IMPROVING NAIVE BAYESIAN SPAM FILTERING

Sage Timberline Office

Multistage Human Resource Allocation for Software Development by Multiobjective Genetic Algorithm

Sage Document Management. User's Guide Version 13.1

Evaluating a Web-Based Information System for Managing Master of Science Summer Projects

Implementation of Statistic Process Control in a Painting Sector of a Automotive Manufacturer

This document is downloaded from DR-NTU, Nanyang Technological University Library, Singapore.

Piracy and Network Externality An Analysis for the Monopolized Software Industry

Sage Document Management. User's Guide Version 12.1

The Fundamental Incompatibility of Scalable Hamiltonian Monte Carlo and Naive Data Subsampling

NBER WORKING PAPER SERIES HOW MUCH OF CHINESE EXPORTS IS REALLY MADE IN CHINA? ASSESSING DOMESTIC VALUE-ADDED WHEN PROCESSING TRADE IS PERVASIVE

Transcription:

Journal of Machine Learning Research 14 (2013) 1989-2028 Submitted 12/11; Revised 8/12; Published 7/13 Machine Learning with Oerational Costs Theja Tulabandhula Deartment of Electrical Engineering and Comuter Science Massachusetts Institute of Technology Cambridge, MA 02139, USA Cynthia Rudin MIT Sloan School of Management and Oerations Research Center Massachusetts Institute of Technology Cambridge, MA 02139, USA THEJA@MIT.EDU RUDIN@MIT.EDU Editor: John-Shawe Taylor Abstract This work rooses a way to align statistical modeling with decision making. We rovide a method that roagates the uncertainty in redictive modeling to the uncertainty in oerational cost, where oerational cost is the amount sent by the ractitioner in solving the roblem. The method allows us to exlore the range of oerational costs associated with the set of reasonable statistical models, so as to rovide a useful way for ractitioners to understand uncertainty. To do this, the oerational cost is cast as a regularization term in a learning algorithm s objective function, allowing either an otimistic or essimistic view of ossible costs, deending on the regularization arameter. From another ersective, if we have rior knowledge about the oerational cost, for instance that it should be low, this knowledge can hel to restrict the hyothesis sace, and can hel with generalization. We rovide a theoretical generalization bound for this scenario. We also show that learning with oerational costs is related to robust otimization. Keywords: statistical learning theory, otimization, covering numbers, decision theory 1. Introduction Machine learning algorithms are used to roduce redictions, and these redictions are often used to make a olicy or lan of action afterwards, where there is a cost to imlement the olicy. In this work, we would like to understand how the uncertainty in redictive modeling can translate into the uncertainty in the cost for imlementing the olicy. This would hel us answer questions like: Q1. What is a reasonable amount to allocate for this task so we can react best to whatever nature brings? Q2. Can we roduce a reasonable robabilistic model, suorted by data, where we might exect to ay a secific amount? Q3. Can our intuition about how much it will cost to solve a roblem hel us roduce a better robabilistic model? The three questions above cannot be answered by standard decision theory, where the goal is to roduce a single olicy that minimizes exected cost. These questions also cannot be answered by c 2013 Theja Tulabandhula and Cynthia Rudin.

TULABANDHULA AND RUDIN robust otimization, where the goal is to roduce a single olicy that is robust to the uncertainty in nature. Those aradigms roduce a single olicy decision that takes uncertainty into account, and the chosen olicy might not be a best resonse olicy to any realistic situation. In contrast, our goal is to understand the uncertainty and how to react to it, using olicies that would be best resonses to individual situations. There are many alications in which this method can be used. For examle, in scheduling staff for a medical clinic, redictions based on a statistical model of the number of atients might be used to understand the ossible olicies and costs for staffing. In traffic flow roblems, redictions based on a model of the forecasted traffic might be useful for determining load balancing olicies on the network and their associated costs. In online advertising, redictions based on models for the ayoff and ad-click rate might be used to understand olicies for when the ad should be dislayed and the associated revenue. In order to roagate the uncertainty in modeling to the uncertainty in costs, we introduce what we call the simultaneous rocess, where we exlore the range of redictive models and corresonding olicy decisions at the same time. The simultaneous rocess was named to contrast with a more traditional sequential rocess, where first, data are inut into a statistical algorithm to roduce a redictive model, which makes recommendations for the future, and second, the user develos a lan of action and rojected cost for imlementing the olicy. The sequential rocess is commonly used in ractice, even though there may actually be a whole class of models that could be relevant for the olicy decision roblem. The sequential rocess essentially assumes that the robabilistic model is correct enough to make a decision that is close enough. In the simultaneous rocess, the machine learning algorithm contains a regularization term encoding the olicy and its associated cost, with an adjustable regularization arameter. If there is some uncertainty about how much it will cost to solve the roblem, the regularization arameter can be swet through an interval to find a range of ossible costs, from otimistic to essimistic. The method then roduces the most likely scenario for each value of the cost. This way, by looking at the full range of the regularization arameter, we swee out costs for all of the reasonable robabilistic models. This range can be used to determine how much might be reasonably allocated to solve the roblem. Having the full range of costs for reasonable models can directly answer the question in the first aragrah regarding allocation, What is a reasonable amount to allocate for this task so we can react best to whatever nature brings? One might choose to allocate the maximum cost for the set of reasonable redictive models for instance. The second question above is Can we roduce a reasonable robabilistic model, suorted by data, where we might exect to ay a secific amount? This is an imortant question, since business managers often like to know if there is some scenario/decision air that is suorted by the data, but for which the oerational cost is low (or high); the simultaneous rocess would be able to find such scenarios directly. To do this, we would look at the setting of the regularization arameter that resulted in the desired value of the cost, and then look at the solution of the simultaneous formulation, which gives the model and its corresonding olicy decision. Let us consider the third question above, which is Can our intuition about how much it will cost to solve a roblem hel us roduce a better robabilistic model? The regularization arameter can be interreted to regulate the strength of our belief in the oerational cost. If we have a strong belief in the cost to solve the roblem, and if that belief is correct, this will guide the choice of regularization arameter, and will hel with rediction. In many real scenarios, a ractitioner or 1990

MACHINE LEARNING WITH OPERATIONAL COSTS domain exert might truly have a rior belief on the cost to comlete a task. Arguably, a manager having this more grounded tye of rior belief is much more natural than, for instance, the manager having a rior belief on the l 2 norm of the coefficients of a linear model, or the number of nonzero coefficients in the model. Being able to encode this tye of rior belief on cost could otentially be helful for rediction: as with other tyes of rior beliefs, it can hel to restrict the hyothesis sace and can assist with generalization. In this work, we show that the restricted hyothesis saces resulting from our method can often be bounded by an intersection of an anl q ball with a halfsace - and this is true for many different tyes of decision roblems. We analyze the comlexity of this tye of hyothesis sace with a technique based on Maurey s Lemma (Barron, 1993; Zhang, 2002) that leads eventually to a counting roblem, where we calculate the number of integer oints within a olyhedron in order to obtain a covering number bound. The oerational cost regularization term can be the otimal value of a comlicated otimization roblem, like a scheduling roblem. This means we will need to solve an otimization roblem each time we evaluate the learning algorithm s objective. However, the ractitioner must be able to solve that roblem anyway in order to develo a lan of action; it is the same roblem they need to solve in the traditional sequential rocess, or using standard decision theory. Since the decision roblem is solved only on data from the resent, whose labels are not yet known, solving the decision roblem may not be difficult, esecially if the number of unlabeled examles is small. In that case, the method can still scale u to huge historical data sets, since the historical data factors into the training error term but not the new regularization term, and both terms can be comuted. An examle is to comute a schedule for a day, based on factors of the various meetings on the schedule that day. We can use a very large amount of ast meeting-length data for the training error term, but then we use only the small set of ossible meetings coming u that day to ass into the scheduling roblem. In that case, both the training error term and the regularization term are able to be comuted, and the objective can be minimized. The simultaneous rocess is a tye of decision theory. To give some background, there are two tyes of relevant decision theories: normative (which assumes full information, rationality and infinite comutational ower) and descritive (models realistic human behavior). Normative decision theories that address decision making under uncertainty can be classified into those based on ignorance (using no robabilistic information) and those based on risk (using robabilistic information). The former include maximax, maximin (Wald), minimax regret (Savage), criterion of realism (Hurwicz), equally likely (Lalace) aroaches. The latter include utility based exected value and bayesian aroaches (Savage). Info-ga, Demster-Shafer, fuzzy logic, and ossibility theories offer non-robabilistic alternatives to robability in Bayesian/exected value theories (French, 1986; Hansson, 1994). The simultaneous rocess does not fit into any of the decision theories listed above. For instance, a core idea in the Bayesian aroach is to choose a single olicy that maximizes exected utility, or minimizes exected cost. Our goal is not to find a single olicy that is useful on average. In contrast, our goal is to trace out a ath of models, their secific (not average) otimal-resonse olicies, and their costs. The olicy from the Bayesian aroach may not corresond to the best decision for any articular single model, whereas that is something we want in our case. We trace out this ath by changing our rior belief on the oerational cost (that is, by changing the strength of our regularization term). In Bayesian decision theory, the rior is over ossible robabilistic models, rather than on ossible costs as in this aer. Constructing this rior over ossible robabilistic models can be challenging, and the rior often ends u being chosen arbitrarily, or as a matter of 1991

TULABANDHULA AND RUDIN convenience. In contrast, we assume only an unknown robability measure over the data, and the data itself defines the ossible robabilistic models for which we comute olicies. Maximax (otimistic) and maximin (essimistic) decision aroaches contrast with the Bayesian framework and do not assume a distribution on the ossible robabilistic models. In Section 4 we will discuss how these aroaches are related to the simultaneous rocess. They overla with the simultaneous rocess but not comletely. Robust otimization is a maximin aroach to decision making, and the simultaneous rocess also differs in rincile from robust otimization. In robust otimization, one would generally need to allocate much more than is necessary for any single realistic situation, in order to roduce a olicy that is robust to almost all situations. However, this is not always true; in fact, we show in this work that in some circumstances, while sweeing through the regularization arameter, one of the results roduced by the simultaneous rocess is the same as the one coming from robust otimization. We introduce the sequential and simultaneous rocesses in Section 2. In Section 3, we give several examles of algorithms that incororate these oerational costs. In doing so, we rovide answers for the first two questions Q1 and Q2 above, with resect to secific roblems. Our first examle alication is a staffing roblem at a medical clinic, where the decision roblem is to staff a set of stations that atients must comlete in a certain order. The time required for atients to comlete each station is random and estimated from ast data. The second examle is a real-estate urchasing roblem, where the olicy decision is to urchase a subset of available roerties. The values of the roerties need to be estimated from comarable sales. The third examle is a call center staffing roblem, where we need to create a staffing olicy based on historical call arrival and service time information. A fourth examle is the Machine Learning and Traveling Reairman Problem (ML&TRP) where the olicy decision is a route for a reair crew. As mentioned above, there is a large subset of roblems that can be formulated using the simultaneous rocess that have a secial roerty: they are equivalent to robust otimization (RO) roblems. Section 4 discusses this relationshi and rovides, under secific conditions, the equivalence of the simultaneous rocess with RO. Robust otimization, when used for decision-making, does not usually include machine learning, nor any other tye of statistical model, so we discuss how a statistical model can be incororated within an uncertainty set for an RO. Secifically, we discuss how different loss functions from machine learning corresond to different uncertainty sets. We also discuss the overla between RO and the otimistic and essimistic versions of the simultaneous rocess. We consider the imlications of the simultaneous rocess on statistical learning theory in Section 5. In articular, we aim to understand how oerational costs affect rediction (generalization) ability. This hels answer the third question Q3, about how intuition about oerational cost can hel roduce a better robabilistic model. We show first that the hyothesis saces for most of the alications in Section 3 can be bounded in a secific way - by an intersection of a ball and a halfsace - and this is true regardless of how comlicated the constraints of the otimization roblem are, and how different the oerational costs are from each other in the different alications. Second, we bound the comlexity of this tye of hyothesis sace using a technique based on Maurey s Lemma (Barron, 1993; Zhang, 2002) that leads eventually to a counting roblem, where we calculate the number of integer oints within a olyhedron in order to obtain a generalization bound. Our results show that it is ossible to make use of much more general structure in estimation roblems, comared to the standard (norm-constrained) structures like sarsity and smoothness; further, this additional structure can benefit generalization 1992

MACHINE LEARNING WITH OPERATIONAL COSTS ability. A shorter version of this work has been reviously ublished (see Tulabandhula and Rudin, 2012). 2. The Sequential and Simultaneous Processes We have a training set of (random) labeled instances,{(x i,y i )} n, where x i X, y i Y that we will use to learn a function f : X Y. Commonly in machine learning this is done by choosing f to be the solution of a minimization roblem: ( n ) f argmin f F unc l( f(x i ),y i )+C 2 R( f), (1) for some loss function l : Y Y R +, regularizer R : F unc R, constant C 2 and function class F unc. Here, Y R. Tyical loss functions used in machine learning are the 0-1 loss, ram loss, hinge loss, logistic loss and the exonential loss. Function class F unc is commonly the class of all linear functionals, where an element f F unc is of the form β T x, where X R, β R. We have used unc in the suerscrit for F unc to refer to the word unconstrained, since it contains all linear functionals. Tyical regularizers R are the l 1 and l 2 norms of β. Note that nonlinearities can be incororated into F unc by allowing nonlinear features, so that we now would have f(x) = β jh j (x), where{h j } j is the set of features, which can be arbitrary nonlinear functions of x; for simlicity in notation, we will equate h j (x)=x j and have X R. Consider an organization making olicy decisions. Given a new collection of unlabeled instances { x i } m, the organization wants to create a olicy π that minimizes a certain oerational cost OCost(π, f,{ x i } i ). Of course, if the organization knew the true labels for the { x i } i s beforehand, it would choose a olicy to otimize the oerational cost based directly on these labels, and would not need f. Since the labels are not known, the oerational costs are calculated using the model s redictions, the f ( x i ) s. The difference between the traditional sequential rocess and the new simultaneous rocess is whether f is chosen with or without knowledge of the oerational cost. As an examle, consider { x i } i as reresenting machines in a factory waiting to be reaired, where the first feature x i,1 is the age of the machine, the second feature x i,2 is the condition at its last insection, etc. The value f ( x i ) is the redicted robability of failure for x i. Policy π is the order in which the machines { x i } i are reaired, which is chosen based on how likely they are to fail, that is, { f ( x i )} i, and on the costs of the various tyes of reairs needed. The traditional sequential rocess icks a model f, based on ast failure data without the knowledge of oerational cost, and afterwards comutes π based on an otimization roblem involving the { f ( x i )} i s and the oerational cost. The new simultaneous rocess icks f and π at the same time, based on otimism or essimism on the oerational cost of π. Formally, the sequential rocess comutes the olicy according to two stes, as follows. Ste 1: Create function f based on{(x i,y i )} i according to (1). That is ( n ) f argmin f F unc l( f(x i ),y i )+C 2 R( f). Ste 2: Choose olicy π to minimize the oerational cost, π argmin π Π OCost(π, f,{ x i } i ). 1993

TULABANDHULA AND RUDIN The oerational cost OCost(π, f,{ x i } i ) is the amount the organization will send if olicy π is chosen in resonse to the values of { f ( x i )} i. To define the simultaneous rocess, we combine Stes 1 and 2 of the sequential rocess. We can choose an otimistic bias, where we refer (all else being equal) a model roviding lower costs, or we can choose a essimistic bias that refers higher costs, where the degree of otimism or essimism is controlled by a arameter C 1. in other words, the otimistic bias lowers costs when there is uncertainty, whereas the essimistic bias raises them. The new stes are as follows. Ste 1: Choose a model f obeying one of the following: Otimistic Bias: f argmin f F unc +C 2 R( f)+c 1 min π Π Pessimistic Bias: f argmin f F unc [ n l(f(x i ),y i ) ] OCost(π, f,{ x i } i ), (2) [ n l(f(x i ),y i ) +C 2 R( f) C 1 minocost(π, f,{ x i } i ) π Π ]. (3) Ste 2: Comute the olicy: π argmin OCost(π, f,{ x i } i ). π Π When C 1 = 0, the simultaneous rocess becomes the sequential rocess; the sequential rocess is a secial case of the simultaneous rocess. The otimization roblem in the simultaneous rocess can be comutationally difficult, articularly if the subroblem to minimize OCost involves discrete otimization. However, if the number of unlabeled instances is small, or if the olicy decision can be broken into several smaller subroblems, then even if the training set is large, one can solve Ste 1 using different tyes of mathematical rogramming solvers, including MINLP solvers (Bonami et al., 2008), Nelder-Mead (Nelder and Mead, 1965) and Alternating Minimization schemes (Tulabandhula et al., 2011). One needs to be able to solve instances of that otimization roblem in any case for Ste 2 of the sequential rocess. The simultaneous rocess is more intensive than the sequential rocess in that it requires reeated solutions of that otimization roblem, rather than a single solution. The regularization term R( f) can be for examle, anl 1 orl 2 regularization term to encourage a sarse or smooth solution. As the C 1 coefficient swings between large values for otimistic and essimistic cases, the algorithm finds the best solution (having the lowest loss with resect to the data) for each ossible cost. Once the regularization coefficient is too large, the algorithm will sacrifice emirical error in favor of lower costs, and will thus obtain solutions that are not reasonable. When that haens, we know we have already maed out the full range of costs for reasonable solutions. This range can be used for re-allocation decisions. By sweeing over a range of C 1, we obtain a range of costs that we might incur. Based on this range, we can choose to allocate a reasonable amount of resources so that we can react best to whatever nature brings. This hels answer question Q1 in Section 1. In addition, we can ick a value of C 1 such that the resulting oerational cost is a secific amount. In this case, we checking 1994

MACHINE LEARNING WITH OPERATIONAL COSTS whether a robabilistic model exists, corresonding to that cost, that is reasonably suorted by data. This can answer question Q2 in Section 1. It is ossible for the set of feasible olicies Π to deend on recommendations{ f( x 1 ),..., f( x m )}, so that Π = Π( f,{ x i } i ) in general. We will revisit this ossibility in Section 4. It is also ossible for the otimization over π Π to be trivial, or the otimization roblem could have a closed form solution. Our notation does accommodate this, and is more general. One should not view the oerational cost as a utility function that needs to be estimated, as in reinforcement learning, where we do not know the cost. Here one knows recisely what the cost will be under each ossible outcome. Unlike in reinforcement learning, we have a comlicated one shot decision roblem at hand and have training data as well as future/unlabeled examles on which the redictive model makes rediction on. The use of unlabeled data { x i } i has been exlored widely in the machine learning literature under semi-suervised, transductive, and unsuervised learning. In articular, we oint out that the simultaneous rocess is not a semi-suervised learning method (see Chaelle et al., 2006), since it does not use the unlabeled data to rovide information about the underlying distribution. A small unlabeled samle is not very useful for semi-suervised learning, but could be very useful for constructing a low-cost olicy. The simultaneous rocess also has a resemblance to transductive learning (see Zhu, 2007), whose goal is to roduce the outut labels on the set of unlabeled examles; in this case, we roduce a function (namely the oerational cost) alied to those outut labels. The simultaneous rocess, for a fixed choice of C 1, can also be considered as a multi-objective machine learning method, since it involves an otimization roblem having two terms with cometing goals (see Jin, 2006). 2.1 The Simultaneous Process in the Context of Structural Risk Minimization In the framework of statistical learning theory (e.g., Vanik, 1998; Pollard, 1984; Anthony and Bartlett, 1999; Zhang, 2002), rediction ability of a class of models is guaranteed when the class has low comlexity, where comlexity is defined via covering numbers, VC (Vanik-Chervonenkis) dimension, Rademacher comlexity, gaussian comlexity, etc. Limiting the comlexity of the hyothesis sace imoses a bias, and the classical image associated with the bias-variance tradeoff is rovided in Figure 1(a). The set of good models is indicated on the axis of the figure. Models that are not good are either overfitted (exlaining too much of the variance of the data, having a high comlexity), or underfitted (having too strong of a bias and a high emirical error). By understanding comlexity, we can find a model class where both the training error and the comlexity are ket low. An examle of increasingly comlex model classes is the set of nested classes of olynomials, starting with constants, then linear functions, second order olynomials and so on. In redictive modeling roblems, there is often no one right statistical model when dealing with finite data sets, in fact there may be a whole class of good models. In addition, it is ossible that a small change in the choice of redictive model could lead to a large change in the cost required to imlement the olicy recommended by the model. This occurs, for instance, when costs are based on objects (e.g., roducts) that come in discrete amounts. Figure 1(b) illustrates this ossibility, by showing that there may be a variety of costs amongst the class of good models. The simultaneous rocess can find the range of costs for the set of good models, which can be used for allocation of costs, as discussed in the first question Q1 in the introduction. 1995

TULABANDHULA AND RUDIN Figure 1: In all three lots, the x-axis reresents model classes with increasing comlexity. a) Relationshi between training error and test error as a function of model comlexity. b) A ossible oerational cost as a function of model comlexity. c) Another ossible oerational cost. Recall that question Q3 asked if our intuition about how much it will cost to solve a roblem can hel us roduce a better robabilistic model. Figure 1 can be used to illustrate how this question can be answered. Assume we have a strong rior belief that the oerational cost will not be above a certain fixed amount. Accordingly, we will choose only amongst the class of low cost models. This can significantly limit the comlexity of the hyothesis sace, because the set of low-cost good models might be much smaller than the full sace of good models. Consider, for examle, the cost dislayed in Figure 1(c), where only models on the left art of the lot would be considered, since they are low cost models. Because the hyothesis sace is smaller, we may be able to roduce a tighter 1996

MACHINE LEARNING WITH OPERATIONAL COSTS bound on the comlexity of the hyothesis sace, thereby obtaining a better rediction guarantee for the simultaneous rocess than for the sequential rocess. In Section 5 we develo results of this tye. These results indicate that in some cases, the oerational cost can be an imortant quantity for generalization. 3. Concetual Demonstrations We rovide four examles. In the first, we estimate manower requirements for a scheduling task. In the second, we estimate real estate rices for a urchasing decision. In the third, we estimate call arrival rates for a call center staffing roblem. In the fourth, we estimate failure robabilities for manholes (access oints to an underground electrical grid). The first two are small scale reroducible examles, designed to demonstrate new tyes of constraints due to oerational costs. In the first examle, the oerational cost subroblem involves scheduling. In the second, it is a knasack roblem, and in the third, it is another multidimensional knasack variant. In the fourth, it is a routing roblem. In the first, second and fourth examles, the oerational cost leads to a linear constraint, while in the third examle, the cost leads to a quadratic constraint. Throughout this section, we will assume that we are working with linear functions f of the form β T x so that Π( f,{ x i } i ) can be denoted by Π(β,{ x i } i ). We will set R( f) to be equal to β 2 2. We will also use the notation F R to denote the set of linear functions that satisfy an additional roerty: F R :={ f F unc : R( f) C 2}, where C2 is a known constant greater than zero. We will use constant C 2 from (1), and also C2 from the definition of F R, to control the extent of regularization. C 2 is inversely related to C2. We use both versions interchangeably throughout the aer. 3.1 Manower Data and Scheduling with Precedence Constraints We aim to schedule the starting times of medical staff, who work at 6 stations, for instance, ultrasound, X-ray, MRI, CT scan, nuclear imaging, and blood lab. Current and incoming atients need to go through some of these stations in a articular order. The six stations and the ossible orders are shown in Figure 2. Each station is denoted by a line. Work starts at the check-in (at time π 1 ) and ends at the check-out (at time π 5 ). The stations are numbered 6-11, in order to avoid confusion with the times π 1 -π 5. The clinic has recedence constraints, where a station cannot be staffed (or work with atients) until the receding stations are likely to finish with their atients. For instance, the check-out should not start until all the revious stations finish. Also, as shown in Figure 2, station 11 should not start until stations 8 and 9 are comlete at time π 4, and station 9 should not start until station 7 is comlete at time π 3. Stations 8 and 10 should not start until station 6 is comlete. (This is related to a similar roblem called lanning with reference osed by F. Malucelli, Politecnico di Milano). The oerational goal is to minimize the total time of the clinic s oeration, from when the checkin haens at time π 1 until the check-out haens at time π 5. We estimate the time it takes for each station to finish its job with the atients based on two variables: the new load of atients for the day at the station, and the number of current atients already resent. The data are available as manower in the R-ackage bestglm, using Hour, Load and Stay columns. The training error is chosen to be the least squares loss between the estimated time for stations to finish their jobs 1997

TULABANDHULA AND RUDIN Figure 2: Staffing estimation with bias on scheduling with recedence constraints. (β T x i ) and the actual times it took to finish (y i ). The unlabeled data are the new load and current atients resent at each station for a given eriod, given as x 6,.., x 11. Let π denote the 5-dimensional real vector with coordinates π 1,...,π 5. The oerational cost is the total time π 5 π 1. Ste 1, with an otimistic bias, can be written as: min n {β: β 2 2 C 2 } (y i β T x i ) 2 +C 1 min (π 5 π 1 ), (4) π Π(β,{ x i } i ) where the feasible set Π(β,{ x i } i ) is defined by the following constraints: π a + β T x i π b ; (a,i,b) {(1,6,2),(1,7,3),(2,8,4),(3,9,4),(2,10,5),(4,11,5)} π a 0 for a=1,...,5. To solve (4) given values of C 1 and C 2, we used a function-evaluation-based scheme called Nelder- Mead (Nelder and Mead, 1965) where at every iterate of β, the subroblem for π was solved to otimality (using Gurobi). 1 C 2 was chosen heuristically based on (1) and ket fixed for the exeriment beforehand. Figure 3 shows the oerational cost, training loss, and r 2 statistic 2 for various values of C 1. For C 1 values between 0 and 0.2, the oerational cost varies substantially, by 16%. The r 2 values for both training and test vary much less, by 3.5%, where the best value haened to have the largest value of C 1. For small data sets, there is generally a variation between training and test: for this data slit, there is a 3.16% difference in r 2 between training and test for lain least squares, and this is similar across various slits of the training and test data. This means that for the scheduling roblem, there is a range of reasonable redictive models within about 3.5% of each other. What we learn from this, in terms of the three questions in the introduction, is that: 1) There is a wide range of ossible costs within the range of reasonable otimistic models. 2) We have found a reasonable scenario, suorted by data, where the cost is 16% lower than in the sequential case. 1. Gurobi is the Gurobi Otimizer v3.0 from Gurobi Otimization, Inc. 2010. 2. If ŷ i are the redicted labels and ȳ is the mean of{y 1,...,y n }, then the value of the r 2 statistic is defined as 1 i (y i ŷ i ) 2 / i (y i ȳ) 2. Thus r 2 is an affine transformation of the sum of squares error. r 2 allows training and test accuracy to be measured on a comarable scale. 1998

MACHINE LEARNING WITH OPERATIONAL COSTS Figure 3: Left: Oerational cost vs C 1. Center: Penalized training loss vs C 1. Right: R-squared statistic. C 1 = 0 corresonds to the baseline, which is the sequential formulation. 3) If we have a rior belief that the cost will be lower, the models that are more accurate are the ones with lower costs, and therefore we may not want to designate the full cost suggested by the sequential rocess. We can erhas designate u to 16% less. Connection to learning theory: In the exeriment, we used tradeoff arameter C 1 to rovide a soft constraint. Considering instead the corresonding hard constraint min π (π 5 π 1 ) α, the total time must be at least the time for any of the three aths in Figure 2, and thus at least the average of them: α min π Π{β,{ x i } i } π 5 π 1 max{( x 6 + x 10 ) T β,( x 6 + x 8 + x 11 ) T β,( x 7 + x 9 + x 11 ) T β} z T β (5) where z= 1 3 [( x 6+ x 10 )+( x 6 + x 8 + x 11 )+( x 7 + x 9 + x 11 )]. The main result in Section 5, Theorem 6, is a learning theoretic guarantee in the resence of this kind of arbitrary linear constraint, z T β α. 3.2 Housing Prices and the Knasack Problem A develoer will urchase 3 roerties amongst the 6 that are currently for sale and in addition, will remodel them. She wants to maximize the total value of the houses she icks (the value of a roerty is its urchase cost lus the fixed remodeling cost). The fixed remodeling costs for the 6 roerties are denoted{c i } 6. She estimates the urchase cost of each roerty from data regarding historical sales, in this case, from the Boston Housing data set (Bache and Lichman, 2013), which has 13 features. Let olicy π {0,1} 6 be the 6-dimensional binary vector that indicates the roerties she urchases. Also, x i reresents the features of roerty i in the training data and x i reresents the features of a different roerty that is currently on sale. The training loss is chosen to be the sum of squares error between the estimated rices β T x i and the true house rices y i for historical sales. The cost (in this case, total value) is the sum of the three roerty values lus the costs for reair work. A essimistic bias on total value is chosen to motivate a min-max formulation. The resulting 1999

TULABANDHULA AND RUDIN (mixed-integer) rogram for Ste 1 of the simultaneous rocess is: min β {β:β R 13, β 2 2 C 2 } +C 1 [ max n 6 π {0,1} 6 (y i β T x i ) 2 (β T x i + c i )π i subject to 6 π i 3 ]. (6) Notice that the second term above is a 1-dimensional {0,1} knasack instance. Since the set of olicies Π does not deend on β, we can rewrite (6) in a cleaner way that was not ossible directly with (4): [ n ] minmax (y i β T x i ) 2 6 +C 1 β π (β T x i + c i )π i subject to β {β : β R 13, β 2 2 C2} { π π : π {0,1} 6, 6 π i 3 }. (7) To solve (7) with user-defined arameters C 1 and C 2, we use fminimax, available through Matlab s Otimization toolbox. 3 For the training and unlabeled set we chose, there is a change in olicy above and below C 1 = 0.05, where different roerties are urchased. Figure 4 shows the oerational cost which is the redicted total value of the houses after remodeling, the training loss, and r 2 values for a range of C 1. The training loss and r 2 values change by less than 3.5%, whereas the total value changes about 6.5%. We can again draw conclusions in terms of the questions in the introduction as follows. The essimistic bias shows that even if the develoer chose the best resonse olicy to the rices, she might end u with the exected total value of the urchased roerties on the order of 6.5% less if she is unlucky. Also, we can now roduce a realistic model where the total value is 6.5% less. We can use this model to hel her understand the uncertainty involved in her investment. Before moving to the next alication of the roosed framework, we rovide a bound analogous to that of (5). Let us relace the soft constraint reresented by the second term of (6) with a hard constraint and then obtain a lower bound: α max π {0,1} 6, 6 π i 3 6 (β T x i )π i 6 (β T x i )π i, (8) where π is some feasible solution of the linear rogramming relaxation of this roblem that also gives a lower objective value. For instance icking π i = 0.5 for,...,6 is a valid lower bound giving us a looser constraint. The constraint can be rewritten: ( β T n 1 2 x i ) α. 3. The version of the toolbox used is Version 5.1, Matlab R2010b, Mathworks, Inc. 2000

MACHINE LEARNING WITH OPERATIONAL COSTS R squared statistic 1 0.9 0.8 0.7 0.6 Training Test 0.5 0 0.2 0.4 0.6 0.8 1 C 1 Figure 4: Left: Oerational cost (total value) vs C 1. Center: Penalized training loss vs C 1. Right: R- squared statistic. C 1 = 0 corresonds to the baseline, which is the sequential formulation. This is again a linear constraint on the function class arametrized by β, which we can use for the analysis in Section 5. Note that if all six roerties were being urchased by the develoer instead of three, the knasack roblem would have a trivial solution and the regularization term would be exlicit (rather than imlicit). 3.3 A Call Center s Workload Estimation and Staff Scheduling A call center management wants to come u with the er-half-hour schedule for the staff for a given day between 10am to 10m. The staff on duty should be enough to meet the demand based on call arrival estimates N(i),,...,24. The staff required will deend linearly on the demand er half-hour. The demand er half-hour in turn will be comuted based on the Erlang C model (Aldor- Noiman et al., 2009) which is also known as the square-root staffing rule. This articular model relates the demand D(i) to the call arrival rate N(i) in the following manner: D(i) N(i)+c N(i) where c determines where on the QED (Quality Efficiency Driven) curve the center wants to oerate on. We make the simlifying assumtions that the service time for each customer is constant, and that the coefficient c is 0. If we know the call arrival rate N(i), we can calculate the staffing requirements during each half hour. If we do not know the call arrival rate, we can estimate it from ast data, and make otimistic or essimistic staffing allocations. There are additional staffing constraints as shown in Figure 5, namely, there are three sets of emloyees who work at the center such that: the first set can work only from 10am-3m, the second can work from 1:30m-6:30m, and the third set works from 5m-10m. The oerational cost is the total number of emloyees hired to work that day (times a constant, which is the amount each erson is aid). The objective of the management is to reduce the number of staff on duty but at the same time maintain a certain quality and efficiency. The call arrivals are modeled as a oisson rocess (Aldor-Noiman et al., 2009). What revious studies (Brown et al., 2001) have discovered about this estimation roblem is that the square root of the call arrival rate tends to behave as a linear function of several features, including: day of the week, time of the day, whether it is a holiday/irregular day, and whether it is close to the end of the billing cycle. 2001

TULABANDHULA AND RUDIN Figure 5: The three shifts for the call center. The cells reresent half-hour eriods, and there are 24 eriods er work day. Work starts at 10am and ends at 10m. Data for call arrivals and features were collected over a eriod of 10 months from Mid-February 2004 to the end of December 2004 (this is the same data set as in Aldor-Noiman et al., 2009). After converting categorical variables into binary encodings (e.g., each of the 7 weekdays into 6 binary features) the number of features is 36, and we randomly slit the data into a training set and test set (2764 instances for training; another 3308 for test). We now formalize the otimization roblem for the simultaneous rocess. Let olicy π Z 3 + be a size three vector indicating the number of emloyees for each of the three shifts. The training loss is the sum of squares error between the estimated square root of the arrival rate β T x i and the actual square root of the arrival rate y i := N(i). The cost is roortional to the total number of emloyees signed u to work, i π i. An otimistic bias on cost is chosen, so that the (mixed-integer) rogram for Ste 1 is: min n β: β 2 2 C 2 +C 1 [min π (y i β T x i ) 2 3 π i subject to a T i π (β T x i ) 2 for,...,24,π Z 3 + ], (9) where Figure 5 illustrates the matrix A with the shaded cells containing entry 1 and 0 elsewhere. The notation a i indicates the i th row of A: { 1 if staff j can work in half-hour eriod i a i ( j)= 0 otherwise. To solve (9) we first relax the l 2 -norm constraint on β by adding another term to the function evaluation, namely C 2 β 2 2. This, way we can use a function-evaluation based scheme that works for unconstrained otimization roblems. As in the manower scheduling examle, we used an 2002

MACHINE LEARNING WITH OPERATIONAL COSTS R squared statistic 0.6 0.5 0.4 0.3 0.2 0.1 Training Test 0 0 0.5 1 1.5 2 C 1 Figure 6: Left: Oerational cost vs C 1. Center: Penalized training loss vs C 1. Right: R-squared statistic. C 1 = 0 corresonds to the baseline, which is the sequential formulation. imlementation of the Nelder-Mead algorithm, where at each ste, Gurobi was used to solve the mixed-integer subroblem for finding the olicy. Figure 6 shows the oerational cost, the training loss, and r 2 values for a range of C 1. The training loss and r 2 values change only 1.6% and 3.9% resectively, whereas the oerational cost changes about 9.2%. Similar to the revious two examles, we can again draw conclusions in terms of the questions in Section 1 as follows. The otimistic bias shows that the management might incur oerational costs on the order of 9% less if they are lucky. Further, the simultaneous rocess roduces a reasonable model where costs are about 9% less. If the management team believes they will be reasonably lucky, they can justify designating substantially less than the amount suggested by the traditional sequential rocess. Let us now investigate the structure of the oerational cost regularization term we have in (9). For convenience, let us stack the quantities (β T x i ) 2 as a vector b R 24. Also let boldface symbol 1 reresent a vector of all ones. If we relace the soft constraint reresented by the second term with a hard constraint having an uer bound α, we get: α min 3 π Z 3 +;Aπ b ( ) 24 1 10 (βt x i ) 2. 1 T π ( ) min 3 π R 3 +;Aπ b 1 T π ( ) = max 24 w R 24 + ;A T w 1 w i (β T x i ) 2 Here α is related to the choice of C 1 and is fixed. ( ) reresents an LP relaxation of the integer rogram with π now belonging to the ositive orthant rather than the cartesian roduct of set of ositive integers. ( ) is due to LP strong duality and( ) is by choosing an aroriate feasible dual variable. Secifically, we ick w i = 1 10 for,...,24, which is feasible because staff cannot work more than 10 half hour shifts (or 5 hours). With the three inequalities, we now have a constraint on β of the form: 24 (β T x i ) 2 10α. This is a quadratic form in β and gives an ellisoidal feasible set. We already had a simle ellisoidal feasibility constraint while defining the minimization roblem of (9) of the form β 2 2 C 2. Thus, we can see that our effective hyothesis set (the set of linear functionals satisfying these constraints) 2003

TULABANDHULA AND RUDIN has become smaller. This in turn affects generalization. We are investigating generalization bounds for this tye of hyothesis set in searate ongoing work. 3.4 The Machine Learning and Traveling Reairman Problem (ML&TRP) (Tulabandhula et al., 2011) Recently, ower comanies have been investing in intelligent roactive maintenance for the ower grid, in order to enhance ublic safety and reliability of electrical service. For instance, New York City has imlemented new insection and reair rograms for manholes, where a manhole is an access oint to the underground electrical system. Electrical grids can be extremely large (there are on the order of 23,000-53,000 manholes in each borough of NYC), and arts of the underground distribution network in many cities can be as old as 130 years, dating from the time of Thomas Edison. Because of the difficulties in collecting and analyzing historical electrical grid data, electrical grid reair and maintenance has been erformed reactively (fix it only when it breaks), until recently (Urbina, 2004). These new roactive maintenance rograms oen the door for machine learning to assist with smart grid maintenance. Machine learning models have started to be used for roactive maintenance in NYC, where suervised ranking algorithms are used to rank the manholes in order of redicted suscetibility to failure (fires, exlosions, smoke) so that the most vulnerable manholes can be rioritized (Rudin et al., 2010, 2012, 2011). The machine learning algorithms make reasonably accurate redictions of manhole vulnerability; however, they do not (nor would they, using any other rediction-only technique) take the cost of reairs into account when making the ranked lists. They do not know that it is unreasonable, for examle, if a reair crew has to travel across the city and back again for each manhole insection, losing imortant time in the rocess. The ower comany must solve an otimization roblem to determine the best reair route, based on the machine learning model s outut. We might wish to find a olicy that is not only suorted by the historical ower grid data (that ranks more vulnerable manholes above less vulnerable ones), but also would give a better route for the reair crew. An algorithm that could find such a route would lead to an imrovement in reair oerations on NYC s ower grid, other ower grids across the world, and imrovements in many different kinds of routing oerations (delivery trucks, trains, airlanes). The simultaneous rocess could be used to solve this roblem, where the oerational cost is the rice to route the reair crew along a grah, and the robabilities of failure at each node in the grah must be estimated. We call this the the machine learning and traveling reairman roblem (ML&TRP) and in our ongoing work (Tulabandhula et al., 2011), we have develoed several formulations for the ML&TRP. We demonstrated, using manholes from the Bronx region of NYC, that it is ossible to obtain a much more ractical route using the ML&TRP, by taking the cost of the route otimistically into account in the machine learning model. We showed also that from the routing roblem, we can obtain a linear constraint on the hyothesis sace, in order to aly the generalization analysis of Section 5 (and in order to address question Q3 of Section 1). 4. Connections to Robust Otimization The goal of robust otimization (RO) is to rovide the best ossible olicy that is accetable under a wide range of situations. 4 This is different from the simultaneous rocess, which aims to find the 4. For more on Robust Otimation see htt://en.wikiedia.org/wiki/robust otimization. 2004

MACHINE LEARNING WITH OPERATIONAL COSTS best olicies and costs for secific situations. Note that it is not always desirable to have a olicy that is robust to a wide range of situations; this is a question of whether to resond to every situation simultaneously or whether to understand the single worst situation that could reasonably occur (which is what the essimistic simultaneous formulation handles). In general, robust otimization can be overly essimistic, requiring us to allocate enough to handle all reasonable situations; it can be substantially more essimistic than the essimistic simultaneous rocess. In robust otimization, if there are several real-valued arameters involved in the otimization roblem, we might declare a reasonable range, called the uncertainty set, for each arameter (e.g., a 1 [9,10], a 2 [1,2]). Using techniques of RO, we would minimize the largest ossible oerational cost that could arise from arameter settings in these ranges. Estimation is not usually involved in the study of robust otimization (with some excetions, see Xu et al., 2009, who consider suort vector machines). On the other hand, one could choose the uncertainty set according to a statistical model, which is how we will build a connection to RO. Here, we choose the uncertainty set to be the class of models that fit the data to within ε, according to some fitting criteria. The major goals of the field of RO include algorithms, geometry, and tractability in finding the best olicy, whereas our work is not concerned with finding a robust olicy, but we are concerned with estimation, taking the olicy into account. Tractability for us is not always a main concern as we need to be able to solve the otimization roblem, even to use the sequential rocess. Using even a small otimization roblem as the oerational cost might have a large imact on the model and decision. If the unlabeled set is not too large, or if the olicy otimization roblem can be broken into smaller subroblems, there is no roblem with tractability. An examle where the olicy otimization might be broken into smaller subroblems is when the olicy involves routing several different vehicles, where each vehicle must visit art of the unlabeled set; in that case there is a small subroblem for each vehicle. On the other hand, even though the goals of the simultaneous rocess and RO are entirely different, there is a strong connection with resect to the formulations for the simultaneous rocess and RO, and a class of roblems for which they are equivalent. We will exlore this connection in this section. There are other methods that consider uncertainty in otimization, though not via the lens of estimation and learning. In the simlest case, one can erform both local and global sensitivity analysis for linear rograms to ascertain uncertainty in the otimal solution and objective, but these techniques generally only handle simle forms of uncertainty (Vanderbei, 2008). Our work is also related to stochastic rogramming, where the goal is to find a olicy that is robust to almost all of the ossible circumstances (rather than all of them), where there are random variables governing the arameters of the roblem, with known distributions (Birge and Louveaux, 1997). Again, our goal is not to find a olicy that is necessarily robust to (almost all of) the worst cases, and estimation is again not the rimary concern for stochastic rogramming, rather it is how to take known randomness into account when determining the olicy. 4.1 Equivalence Between RO and the Simultaneous Process in Some Cases In this subsection we will formally introduce RO. In order to connect RO to estimation, we will define the uncertainty set for RO, denoted F good, to be models for which the average loss on the samle is within ε of the lowest ossible. Then we will resent the equivalence relationshi between RO and the simultaneous rocess, using a minimax theorem. 2005

TULABANDHULA AND RUDIN In Section 2, we had introduced the notation {(x i,y i )} i and { x i } i for labeled and unlabeled data resectively. We had also introduced the class F unc in which we were searching for a function f by minimizing an objective of the form (1). The uncertainty set F good will turn out to be a subset of F unc that deends on{(x i,y i )} i and f but not on{ x i } i. We start with lain (non-robust) otimization, using a general version of the vanilla sequential rocess. Let f denote an element of the set F good, where f is re-determined, known and fixed. Let the otimization roblem for the olicy decision π be defined by: min OCost(π, f ;{ x i}), (Base roblem) (10) π Π( f ;{ x} i ) where Π( f ;{ x i }) is the feasible set for the otimization roblem. Note that this is a more general version of the sequential rocess than in Section 2, since we have allowed the constraint set Π to be a function of both f and { x i } i, whereas in (2) and (3), only the objective and not the constraint set can deend on f and { x i } i. Allowing this more general version of Π will allow us to relate (10) to RO more clearly, and will hel us to secify the additional assumtions we need in order to show the equivalence relationshi. Secifically, in Section 2, OCost deends on ( f,{ x i } i ) but not Π; whereas in RO, generally Π deends on ( f,{ x i } i ) but not OCost. The fact that OCost does not need to deend on f and{ x i } i is not a serious issue, since we can generally remove their deendence through auxiliary variables. For instance, if the roblem is a minimization of the form (10), we can use an auxiliary variable, say t, to obtain an equivalent roblem: mint π,t such that π Π( f ;{ x i }) OCost(π, f ;{ x i }) t (Base roblem reformulated) where the deendence on ( f,{ x i } i ) is resent only in the (new) feasible set. Since we had assumed f to be fixed, this is a deterministic otimization roblem (convex, mixed-integer, nonlinear, etc.). Now, consider the case when f is not known exactly but only known to lie in the uncertainty set F good. The robust counterart to (10) can then be written as: π min g F good Π(g;{ x} i ) max f F good OCost(π, f ;{ x i }) (Robust counterart) (11) where we obtain a robustly feasible solution that is guaranteed to remain feasible for all values of f F good. In general, (11) is much harder to solve than (10) and is a toic of much interest in the robust otimization community. As we discussed earlier, there is no focus in (11) on estimation, but it is ossible to embed an estimation roblem within the descrition of the set F good, which we now define formally. In Section 3, F R (a subset of F unc ) was defined as the set of linear functionals with the roerty that R( f) C2. That is, F R ={ f : f F unc,r( f) C 2}. We define F good as a subset of F R by adding an additional roerty: { F good = f : f F R, n l(f(x i ),y i ) n l(f (x i ),y i )+ε }, (12) 2006

MACHINE LEARNING WITH OPERATIONAL COSTS for some fixed ositive real ε. In (12), again f is a solution that minimizes the objective in (1) over F unc. The right hand side of the inequality in (12) is thus constant, and we will henceforth denote it with a single quantity C1. Substituting this definition of F good in (11), and further making an imortant assumtion (denoted A1) that Π is not a function of ( f,{ x i } i ), we get the following otimization roblem: ] min max OCost(π, f,{ x i } i ) (Robust counterart with assumtions) (13) [ π Π{ f F R : n l( f(x i),y i ) C1 } where C1 now controls the amount of the uncertainty via the set F good. Before we state the equivalence relationshi, we restate the formulations for otimistic and essimistic biases on oerational cost in the simultaneous rocess from (2) and (3): ] unc[ n min l(f(x i ),y i )+C 2 R( f)+c 1 min OCost(π, f,{ x i} i ) (Simultaneous otimistic), f F π Π ] unc[ n min l(f(x i ),y i )+C 2 R( f) C 1 min OCost(π, f,{ x i} i ) f F π Π (Simultaneous essimistic). (14) Aart from the assumtion A1 on the decision set Π that we made in (13), we will also assume that F good defined in (12) is convex; this will be assumtion A2. If we also assume that the objective OCost satisfies some nice roerties (A3), and that uncertainty is characterized via the set F good, then we can show that the two roblems, namely (14) and (13), are equivalent. Let denote equivalence between two roblems, meaning that a solution to one side translates into the solution of the other side for some arameter values (C 1,C 1,C 2,C 2 ). Proosition 1 Let Π( f ;{ x i } i )=Π be comact, convex, and indeendent of arameters f and{ x i } i (assumtion A1). Let { f F R : n l( f(x i),y i ) C1 } be convex (assumtion A2). Let the cost (to be minimized) OCost(π, f,{ x i } i ) be concave continuous in f and convex continuous in π (assumtion A3). Then, the robust otimization roblem (13) is equivalent to the essimistic bias otimization roblem (14). That is, ] min max OCost(π, f,{ x i } i ) [ π Π{ f F R : n l( f(x i),y i ) C1 } unc[ n min l(f(x i ),y i )+C 2 R( f) C 1 min OCost(π, f,{ x i} i ) f F π Π ]. Remark 2 That the equivalence alies to linear rograms (LPs) is clear because the objective is linear and the feasible set is generally a olyhedron, and is thus convex. For integer rograms, the objective OCost satisfies continuity, but the feasible set is tyically not convex, and hence, the result does not generally aly to integer rograms. In other words, the requirement that the constraint set Π be convex excludes integer rograms. To rove Proosition 1, we restate a well-known generalization of von Neumann s minimax theorem and some related definitions. 2007

TULABANDHULA AND RUDIN Definition 3 A linear toological sace (also called a toological vector sace) is a vector sace over a toological field (tyically, the real numbers with their standard toology) with a toology such that vector addition and scalar multilication are continuous functions. For examle, any normed vector sace is a linear toological sace. A function h is uer semicontinuous at a oint 0 if for every ε>0 there exists a neighborhood U of 0 such that h() h( 0 )+ε for all U. A function h defined over a convex set is quasi-concave if for all,q and λ [0,1] we have h(λ +(1 λ)q) min(h(), h(q)). Similar definitions follow for lower semicontinuity and quasi-convexity. Theorem 4 (Sion s minimax theorem Sion, 1958) Let Π be a comact convex subset of a linear toological sace and Ξ be a convex subset of a linear toological sace. Let G(π,ξ) be a real function on Π Ξ such that (i) G(π, ) is uer semicontinuous and quasi-concave on Ξ for each π Π; (ii) G(,ξ) is lower semicontinuous and quasi-convex on Π for each ξ Ξ. Then min su π Π ξ Ξ G(π,ξ)=sumin G(π,ξ). π Π We can now roceed to the roof of Proosition (1). Proof (Of Proosition 1) We start from the left hand side of the equivalence we want to rove: ] min max OCost(π, f,{ x i } i ) (a) [ π Π{ f F R : n l( f(x i),y i ) C1 } max min { f F R : n l( f(x i),y i ) C1 } π Π (b) max [ 1 ( n f F unc C 1 l( f(x i ),y i ) C1 ξ Ξ [ ] OCost(π, f,{ x i } i ) ) C ( ) 2 R( f) C2 + min C 1 unc[ (c) n min l(f(x i ),y i )+C 2 R( f) C 1 min OCost(π, f,{ x i} i ) f F π Π ] OCost(π, f,{ x i} i ) π Π ]. which is the right hand side of the logical equivalence in the statement of the theorem. In ste (a) we alied Sion s minimax theorem (Theorem 4) which is satisfied because of the assumtions we made. In ste (b), we icked Lagrange coefficients, namely 1 C 1 and C 2 C 1, both of which are ositive. In articular, C1 and C 1 as well as C2 and C 2 are related by the Lagrange relaxation equivalence (strong duality). In (c), we multilied the objective with C 1 throughout, ulled the negative sign in front, and removed the constant terms C1 and C 2C2 and used the following observation: max a g(a) = min a g(a); and finally, removed the negative sign in front as this does not affect equivalence. The equivalence relationshi of Proosition 1 shows that there is a roblem class in which each instance can be viewed either as a RO roblem or an estimation roblem with an oerational cost bias. We can use ideas from RO to make the simultaneous rocess more general. Before doing so, we will characterize F good for several secific loss functions. 2008