Gaussian Process Optimization with Mutual Information


 Emily Summers
 1 years ago
 Views:
Transcription
1 Emile Contal CMLA, UMR CNRS 8536, ENS Cachan, France Vianne Perchet LPMA, Université Paris Diderot, France Nicolas Vaatis CMLA, UMR CNRS 8536, ENS Cachan, France Abstract In this paper, we analze a generic algorithm scheme for sequential global optimization using Gaussian processes. he upper bounds we derive on the cumulative regret for this generic algorithm improve b an exponential factor the previousl known bounds for algorithms like GP. We also introduce the novel Gaussian Process Mutual Information algorithm (GP), which significantl improves further these upper bounds for the cumulative regret. We confirm the efficienc of this algorithm on snthetic and real tasks against the natural competitor, GP, and also the Expected Improvement heuristic.. Introduction Stochastic optimization problems are encountered in numerous real world domains including engineering design (Wang & Shan, 7), finance (Ziemba & Vickson, 6), natural sciences (Floudas & Pardalos, ), or in machine learning for selecting models b tuning the parameters of learning algorithms (Snoek et al., ). We aim at finding the input of a given sstem which optimizes the output (or reward). In this view, an iterative procedure uses the previousl acquired measures to choose the next quer predicted to be the most useful. he goal is to maximize the sum of the rewards received at each iteration, that is to minimize the cumulative regret b balancing exploration (gathering information b favoring locations with high uncertaint) and exploitation (focusing on the optimum b favoring locations with high predicted reward). his optimization task becomes challenging when the dimension Proceedings of the 3 st International Conference on Machine Learning, Beijing, China, 4. JMLR: W&CP volume 3. Copright 4 b the author(s). of the search space is high and the evaluations are nois and expensive. Efficient algorithms have been studied to tackle this challenge such as multiarmed bandit (Auer et al., ; Kleinberg, 4; Bubeck et al., ; Audibert et al., ), active learning (Carpentier et al., ; Chen & Krause, 3) or Baesian optimization (Mockus, 989; Grunewalder et al., ; Srinivas et al., ; de Freitas et al., ). he theoretical analsis of such optimization procedures requires some prior assumptions on the underling function f. Modeling f as a function distributed from a Gaussian process (GP) enforces nearb locations to have close associated values, and allows to control the general smoothness of f with high probabilit according to the kernel of the GP (Rasmussen & Williams, 6). Our main contribution is twofold: we propose a generic algorithm scheme for Gaussian process optimization and we prove sharp upper bounds on its cumulative regret. he theoretical analsis has a direct impact on strategies built with the GP algorithm (Srinivas et al., ) such as (Krause & Ong, ; Desautels et al., ; Contal et al., 3). We suggest an alternative polic which achieves an exponential speed up with respect to the cumulative regret. We also introduce a novel algorithm, the Gaussian Process Mutual Information algorithm (GP), which improves furthermore upper bounds for the cumulative regret from O( a (log ) d+ ) for GP, the current state of the art, to the spectacular O( a (log ) d+ ), where is the number of iterations, d is the dimension of the input space and the kernel function is Gaussian. he remainder of this article is organized as follows. We first introduce the setup and notations. We define the GP algorithm in Section. Main results on the cumulative regret bounds are presented in Section 3. We then provide technical details in Section 4. We finall confirm the performances of GP on real and snthetic tasks compared to the state of the art of GP optimization and some heuristics used in practice.
2 .. he Gaussian process framework Figure. One dimensional Gaussian process inference of the posterior mean µ (blue line) and posterior deviation σ (half of the height of the gra envelope) with squared exponential kernel, based on four observations (blue crosses).. Gaussian process optimization and the GP algorithm.. Sequential optimization and cumulative regret Let f : X R, where X R d is a compact and convex set, be the unknown function modeling the sstem we want to be optimized. We consider the problem of finding the maximum of f denoted b: f(x ) = max x X f(x), via successive queries x, x,... X. At iteration +, the choice of the next quer x + depends on the previous nois observations, Y = {,..., } at locations X = {x,..., x } where t = f(x t ) + ɛ t for all t, and the noise variables ɛ,..., ɛ are independentl distributed as a Gaussian random variable N (, σ ) with zero mean and variance σ. he efficienc of a polic and its abilit to address the exploration/exploitation tradeoff is measured via the cumulative regret R, defined as the sum of the instantaneous regret r t, the gaps between the value of the maximum and the values at the sample locations, r t = f(x ) f(x t ) for t and R = f(x ) f(x t ). Our aim is to obtain upper bounds on the cumulative regret R with high probabilit. Prior assumption on f. In order to control the smoothness of the underling function, we assume that f is sampled from a Gaussian process GP(m, k) with mean function m : X R and kernel function k : X X R +. We formalize in this manner the prior assumption that high local variations of f have low probabilit. he prior mean function is considered without loss of generalit to be zero, as the kernel k can completel define the GP (Rasmussen & Williams, 6). We consider the normalized and dimensionless framework introduced b (Srinivas et al., ) where the variance is assumed to be bounded, that is k(x, x) for all x X. Baesian inference. At iteration +, given the previousl observed nois values Y at locations X, we use Baesian inference to compute the current posterior distribution (Rasmussen & Williams, 6), which is a GP of mean µ + and variance σ + given at an x X b, µ + (x) = k (x) C Y () and σ +(x) = k(x, x) k (x) C k (x), () where k (x) = [k(x t, x)] xt X is the vector of covariances between x and the quer points at time, and C = K +σ I with K = [k(x t, x t )] xt,x t X the kernel matrix, σ is the variance of the noise and I stands for the identit matrix. he Baesian inference is illustrated on Figure in a sample problem in dimension one, where the posteriors are based on four observations of a Gaussian Process with squared exponential kernel. he height of the gra area represents two posterior standard deviations at each point..3. he Gaussian Process Mutual Information algorithm he Gaussian Process Mutual Information algorithm (GP ) is presented as Algorithm. he ke statement is the choice of the quer, x t = argmax x X µ t (x) + φ t (x). he exploitation abilit of the procedure is driven b µ t, while the exploration is governed b φ t : X R, which is an increasing function of σt (x). he novelt in the GP algorithm is that φ t is empiricall controlled b the amount of exploration that has been alread done, that is, the more the algorithm has gathered information on f, the more it will focus on the optimum. In the GP algorithm from (Srinivas et al., ) the exploration coefficient is a O(log t) and therefore tends to infinit. he parameter α in Algorithm governs the tradeoff between precision and confidence, as shown in heorem. he efficienc of the algorithm is robust to the choice of its value. We confirm empiricall this propert and provide further discussion on the calibration of α in Section 5.
3 Algorithm GP pγ for t =,,... do // Baesian inference (Eq., ) Compute µ t and σt // Definition of φ t (x) for all x X φ t (x)? α ˆbσ t (x) + pγ t a pγ t // Selection of the next quer location x t argmax x X µ t (x) + φ t (x) // Update pγ t pγ t pγ t + σ t (x t ) // Quer Sample x t and observe t end for Mutual information. he quantit pγ controlling the exploration in our algorithm forms a lower bound on the information acquired on f b the quer points X. he information on f is formall defined b I (X ), the mutual information between f and the nois observations Y at X, hence the name of the GP algorithm. For a Gaussian process distribution I (X ) = log det(i + σ K ) where K is the kernel matrix [k(x i, x j )] xi,x j X. We refer to (Cover & homas, 99) for further reading on mutual information. We denote b γ = max X X : X = I (X ) the maximum mutual information obtainable b a sequence of quer points. In the case of Gaussian processes with bounded variance, the following inequalit is satisfied (Lemma 5.4 in (Srinivas et al., )): pγ = σt (x t ) C γ (3) where C = log(+σ ) and σ is the noise variance. he upper bounds on the cumulative regret we derive in the next section depend mainl on this ke quantit. 3. Main Results 3.. Generic Optimization Scheme We first consider the generic optimization scheme defined in Algorithm, where we let φ t as a generic function viewed as a parameter of the algorithm. We onl require φ t to be measurable with respect to Y t, the observations available at iteration t. he theoretical analsis of Algorithm can be used as a plugin theorem for existing algorithms. For example the GP algorithm with parameter β t = O(log t) is obtained with φ t (x) = a β t σ t (x). A generic analsis of Algorithm leads to the following upper bounds on the cumulative regret with high probabilit. Algorithm Generic Optimization Scheme (φ t ) for t =,,... do Compute µ t and φ t x t argmax x X µ t (x) + φ t (x) Sample x t and observe t end for heorem (Regret bounds for the generic algorithm). For all δ > and >, the regret R incurred b Algorithm on f distributed as a GP perturbed b independent Gaussian noise with variance σ satisfies the following bound with high probabilit, with C = log(+σ ) and α = log δ : Pr [ R pφ t (x t ) φ t (x )q + 4 a? α α(c γ + ) + ] σ t (x )? δ. C γ + he proof of heorem, relies on concentration guarantees for Gaussian processes (see Section 4.). heorem provides an intermediate result used for the calibration of φ t to face the exploration/exploitation tradeoff. For example b choosing φ t (x) =? α σ t (x) (where the dimensional constant is hidden), Algorithm becomes a variant of the GP algorithm where in particular the exploration parameter? β t is fixed to? α instead of being an increasing function of t. he upper bounds on the cumulative regret with this definition of φ t are of the form R = O(γ ), as stated in Corollar. We then also consider the case where the kernel k of the Gaussian process is under the form of a squared exponential (RBF) kernel, k(x, x ) = exp x x l, for all x, x X and length scale l R. In this setting, the maximum mutual information γ satisfies the upper bound γ = Op(log ) d+ q, where d is the dimension of the input space (Srinivas et al., ). Corollar. Consider the Algorithm where we set φ t (x) =? α σ t (x). Under the assumptions of heorem, we have that the cumulative regret for Algorithm satisfies the following upper bounds with high probabilit: For f sampled from a GP with general kernel: R = O(γ ). For f sampled from a GP with RBF kernel: R = Op(log ) d+ q. o prove Corollar we appl heorem with the given definition of φ t and then Equation 3, which leads to Pr rr? α C γ + 4 a α(c γ + )s δ. he
4 previousl known upper bounds on the cumulative regret for the GP algorithm are of the form R = Op? β γ q where β = Op log δ q. he improvement of the generic Algorithm with φ t (x) =? α σ t (x) over the GP algorithm with respect to the cumulative regret is then exponential in the case of Gaussian processes with RBF kernel. For f sampled from a GP with linear kernel, corresponding to f(x) = w x with w N (, I), we obtain R = Opd log q. We remark that the GP assumption with linear kernel is more restrictive than the linear bandit framework, as it implies a Gaussian prior over the linear coefficients w. Hence there is no contradiction with the lower bounds stated for linear bandit like those of (Dani et al., 8). We refer to (Srinivas et al., ) for the analsis of γ with other kernels widel used in practice. 3.. Regret bounds for the GP algorithm We present here the main result of the paper, the upper bound on the cumulative regret for the GP algorithm. heorem (Regret bounds for the GP algorithm). For all δ > and >, the regret R incurred b Algorithm on f distributed as a GP perturbed b independent Gaussian noise with variance σ satisfies the following bound with high probabilit, with C = log(+σ ) and α=log δ : Pr R 5 a αc γ + 4? ı α δ. he proof for heorem is provided in Section 4., where we analze the properties of the exploration functions φ t. Corollar describes the case with RBF kernel for the GP algorithm. Corollar (RBF kernels). he cumulative regret R incurred b Algorithm on f sampled from a GP with RBF kernel satisfies with high probabilit, R = O (log ) d+. he GP algorithm significantl improves the upper bounds for the cumulative regret over the GP algorithm and the alternative polic of Corollar. 4. heoretical analsis In this section, we provide the proofs for heorem and heorem. he approach presented here to stud the cumulative regret incurred b Gaussian process optimization strategies is general and can be used further for other algorithms. 4.. Analsis of the general algorithm he theoretical analsis of heorem uses a similar approach to the AzumaHoeffding inequalit adapted for Gaussian processes. Let r t = f(x ) f(x t ) for all t. We define M, which is shown later to be a martingale with respect to Y, M = r t pµ t (x ) µ t (x t )q, (4) for and M =. Let Y t be defined as the martingale difference sequence with respect to M, that is the difference between the instantaneous regret and the gap between the posterior mean for the optimum and the one for the point queried, Y t = M t M t = r t pµ t (x ) µ t (x t )q for t. Lemma. he sequence M is a martingale with respect to Y and for all t, given Y t, the random variable Y t is distributed as a Gaussian N (, l t ) with zero mean and variance l t, where: l t = σ t (x ) + σ t (x t ) k(x, x t ). (5) Proof. From the GP assumption, we know that given Y t, the distribution of f(x) is Gaussian N pµ t (x), σ t (x)q for all x X, and r t is a projection of a Gaussian random vector, that is r t is distributed as a Gaussian N pµ t (x ) µ t (x t ), l t q and Y t is distributed as Gaussian N (, l t ), with l t = σ t (x ) + σ t (x t ) k(x, x t ), hence M is a Gaussian martingale. We now give a concentration result for M using inequalities for selfnormalized martingales. Lemma. For all δ > and >, the martingale M normalized b the predictable quadratic variation l t satisfies the following concentration inequalit with α = log δ and = 8(C γ + ): Pr [ M a α + c α ] σt (x ) δ. Proof. Let = 8(C γ + ). We introduce the notation Pr > [A] for Pr[A l t > ] and Pr [A] for Pr[A l t ]. Given that M t is a Gaussian martingale, using heorem 4. and Remark 4. from (Bercu & ouati, 8) with M = l t and a = and b = we obtain for all x > : Pr > With x = b α M l t ı > x < exp x where α = log δ, we have: Pr > M > b α. ı l t < δ.
5 B definition of l t in Eq. 5 and with k(x, x t ), we have for all t that l t σt (x t ) + σt (x ). Using Equation 3 we have pγ 8, we finall get: Pr > M >? b α 8 + α σ t (x )ı < δ. (6) Now, using heorem 4. and Remark 4. from (Bercu & ouati, 8) the following inequalit is satisfied for all x > : Pr rm > xs < exp. With x =? α we have: x Pr rm >? αs < δ. (7) Combining Equations 6 and 7 leads to, Pr «proving Lemma. M > a α + c α ff σt (x ) < δ, he following lemma concludes the proof of heorem using the previous concentration result and the properties of the generic Algorithm. Lemma 3. he cumulative regret for Algorithm on f sampled from a GP satisfies the following bound for all δ > and α and defined in Lemma : Pr [ + R c α pφ t (x t ) φ t (x )q + a α ] σt (x ) δ. Proof. B construction of the generic Algorithm, we have x t = argmax x X µ t (x) + φ t (x), which guarantees for all t that µ t (x ) µ t (x t ) φ t (x t ) φ t (x ). Replacing M b its definition in Eq. 4 and using the previous propert in Lemma proves Lemma Analsis of the GP algorithm In order to bound the cumulative regret for the GP algorithm, we focus on an alternative definition of the exploration functions φ t where the last term is modified inductivel so as to simplif the sum φ t(x t ) for all >. Being a constant term for a fixed t >, Algorithm remains unchanged. Let φ t be defined as, b t φ t (x) = α(σt (x) + pγ t ) φ i (x i ), i= where x t is the point selected b Algorithm at iteration t. We have for all >, φ t (x t ) = a αpγ φ t (x t ) + φ t (x t ) = a αpγ. (8) We can now derive upper bounds for pφ t(x t ) φ t (x )q which will be plugged in heorem in order to cancel out the terms involving x. In this manner we can calibrate sharpl the exploration/exploitation tradeoff b optimizing the remaining terms. Lemma 4. For the GP algorithm, the exploration term in the equation of heorem satisfies the following inequalit: pφ t (x t ) φ t (x )q a? α αpγ σ t (x ) a pγ + Proof. Using our alternative definition of φ t which gives the equalit stated in Equation 8, we know that, pφ t (x t ) φ t (x )q =? α apγ + a b pγt pγ t + σt (x ). B concavit of the square root, we have for all a b that? a + b? a b? a. Introducing the notations a t = pγ t + σt (x ) and b t = σt (x ), we obtain, pφ t (x t ) φ t (x )q a? α αpγ + b t? at. Moreover, with σ t (x) for all x X, we have a t pγ + and b t for all t which gives, b t? σ t (x ) a, at pγ + leading to the inequalit of Lemma 4. he following lemma combines the results from heorem and Lemma 4 to derive upper bounds on the cumulative regret for the GP algorithm with high probabilit. Lemma 5. he cumulative regret for Algorithm on f sampled from a GP satisfies the following bound for all δ > and α defined in Lemma, Pr R 5 a αc γ + 4? ı α δ..
6 Proof. Considering heorem in the case of the GP algorithm and bounding pφ t(x t ) φ t (x )q with Lemma 4, we obtain the following bound on the cumulative regret incurred b GP: [ Pr R a? α αpγ + 4 a? α α(c γ + ) + Gaussian Process Optimization with Mutual Information σ t (x ) a pγ + ] σ t (x )? δ, C γ + which simplifies to the inequalit of Lemma 5 using Equation 3, and thus proves heorem. 5. Practical considerations and experiments 5.. Numerical experiments Protocol. We compare the empirical performances of our algorithm against the stateoftheart of GP optimization, the GP algorithm (Srinivas et al., ), and a commonl used heuristic, the Expected Improvement () algorithm with GP (Jones et al., 998). he tasks used for assessment come from two real applications and five snthetic problems described here. For all data sets and algorithms the learners were initialized with a random subset of observations {(x i, i )} i. When the prior distribution of the underling function was not known, the Baesian inference was made using a squared exponential kernel. We first picked the half of the data set to estimate the hperparameters of the kernel via cross validation in this subset. In this wa, each algorithm was running with the same prior information. he value of the parameter δ for the GP and the GP algorithms was fixed to δ = 6 for all these experimental tasks. Modifing this value b several orders of magnitude is insignificant with respect to the empirical mean cumulative regret incurred b the algorithms, as discussed in Section 5.. he results are provided in Figure 3. he curves show the evolution of the average regret R in term of iteration. We report the mean value with the confidence interval over a hundred experiments. Description of the data sets. data sets used for assessment. We describe briefl all the Generated GP. he generated Gaussian process functions are random GPs drawn from an isotropic Matérn kernel in dimension and 4, with the kernel bandwidth set to for dimension, and 6 for dimension 4. he Matérn parameter was set to ν = 3 and the noise standard deviation to % of the signal standard deviation. Gaussian Mixture. his snthetic function comes from the addition of three D Gaussian functions. (a) Gaussian mixture (b) Himmelblau Figure. Visualization of the snthetic functions used for assessment We then perturb these Gaussian functions with smooth variations generated from a Gaussian Process with isotropic Matérn Kernel and % of noise. It is shown on Figure (a). he highest peak being thin, the sequential search for the maximum of this function is quite challenging. Himmelblau. his task is another snthetic function in dimension. We compute a slightl tilted version of the Himmelblau s function with the addition of a linear function, and take the opposite to match the challenge of finding its maximum. his function presents four peaks but onl one global maximum. It gives a practical wa to test the abilit of a strateg to manage exploration/exploitation tradeoffs. It is represented in Figure (b).
7 Branin. he Branin or BraninHoo function is a common benchmark function for global optimization. It presents three global optimum in the D square [ 5, ] [, 5]. his benchmark is one of the two snthetic functions used b (Srinivas et al., ) to evaluate the empirical performances of the GP algorithm. No noise has been added to the original signal in this experimental task. GoldsteinPrice. he Goldstein & Price function is an other benchmark function for global optimization, with a single global optimum but several local optima in the D square [, ] [, ]. his is the second snthetic benchmark used b (Srinivas et al., ). Like in the previous challenge, no noise has been added to the original signal. sunamis. Recent posttsunami surve data as well as the numerical simulations of (Hill et al., ) have shown that in some cases the runup, which is the maximum vertical extent of wave climbing on a beach, in areas which were supposed to be protected b small islands in the vicinit of coast, was significantl higher than in neighboring locations. Motivated b these observations (Stefanakis et al., ) investigated this phenomenon b emploing numerical simulations using the VOLNA code (Dutkh et al., ) with the simplified geometr of a conical island sitting on a flat surface in front of a sloping beach. In the stud of (Stefanakis et al., 3) the setup was controlled b five phsical parameters and the aim was to find with confidence and with the least number of simulations the parameters leading to the maximum runup amplification. MackeGlass function. he MackeGlass deladifferential equation is a chaotic sstem in dimension 6, but without noise. It models real feedback sstems and is used in phsiological domains such as hematolog, cardiolog, neurolog, and pschiatr. he highl chaotic behavior of this function makes it an exceptionall difficult optimization problem. It has been used as a benchmark for example b (Flake & Lawrence, ). Empirical comparison of the algorithms. Figure 3 compares the empirical mean average regret R for the three algorithms. On the eas optimization assessments like the Branin data set (Fig. 3(e)) the three strategies behave in a comparable manner, but the GP algorithm incurs a larger cumulative regret. For more difficult assessments the GP algorithm performs poorl and our algorithm alwas surpasses the heuristic. he improvement of the GP algorithm against the two competitors is the most significant for exceptionall challenging optimization tasks as illustrated in Figures 3(a) to 3(d) and R / R / R / R / (a) Generated GP(d = ) , (c) Gaussian Mixture (e) Branin (g) sunamis (b) Generated GP(d = 4) (d) Himmelblau (f) Goldstein , (h) MackeGlass Figure 3. Empirical mean and confidence interval of the average regret R in term of iteration on real and snthetic tasks for the GP and GP algorithms and the heuristic (lower is better). 3(h), where the underling functions present several local optima. he abilit of our algorithm to deal with the exploration/exploitation tradeoff is emphasized b these experimental results as its average regret decreases directl after the first iterations, avoiding unwanted exploration like GP on Figures 3(a) to 3(d), or getting stuck in some local optimum like on Figures 3(c), 3(g) and 3(h). We further mention that the GP algorithm is empiricall robust against the number of dimensions of the data set (Fig. 3(b), 3(g), 3(h)). 5.. Practical aspects Calibration of α. he value of the parameter α is chosen following heorem as α = log δ with < δ < being a confidence parameter. he guarantees we prove in Section 4. on the cumulative regret for the GP algorithm holds with probabilit at least δ. With α increasing linearl for δ decreasing exponentiall toward, the algorithm is
8 R / δ = 9 δ = 6 δ = 3 δ = References Audibert, JY., Bubeck, S., and Munos, R. Bandit view on nois optimization. In Optimization for Machine Learning, pp Press,. Auer, P., CesaBianchi, N., and Fischer, P. Finitetime analsis of the multiarmed bandit problem. Machine Learning, 47(3):35 56,. Bercu, B. and ouati, A. Exponential inequalities for selfnormalized martingales with applications. he Annals of Applied Probabilit, 8(5): , Bubeck, S., Munos, R., Stoltz, G., and Szepesvári, C. X armed bandits. Journal of Machine Learning Research, : ,. Figure 4. Small impact of the value of δ on the mean average regret of the GP algorithm running on the Himmelblau data set. robust to the choice of δ. We present on Figure 4 the small impact of δ on the average regret for four different values selected on a wide range. Numerical Complexit. Even if the numerical cost of GP is insignificant in practice compared to the cost of the evaluation of f, the complexit of the sequential Baesian update (Osborne, ) is O( ) and might be prohibitive for large. One can reduce drasticall the computational time b means of Laz Variance Calculation (Desautels et al., ), built on the fact that σ (x) alwas decreases for increasing and for all x X. We further mention that approximated inference algorithms such as the EP approximation and MCMC sampling (Kuss et al., 5) can be used as an alternative if the computational time is a restrictive factor. 6. Conclusion We introduced the GP algorithm for GP optimization and prove upper bounds on its cumulative regret which improve exponentiall the stateoftheart in common settings. he theoretical analsis was presented in a generic framework in order to expand its impact to other similar algorithms. he experiments we performed on real and snthetic assessments confirmed empiricall the efficienc of our algorithm against both the theoretical stateoftheart of GP optimization, the GP algorithm, and the commonl used heuristic. ACKNOWLEDGEMENS he authors would like to thank David Buffoni and Raphaël Bonaque for fruitful discussions. he author also thank the anonmous reviewers for their detailed feedback. Carpentier, A., Lazaric, A., Ghavamzadeh, M., Munos, R., and Auer, P. Upperconfidencebound algorithms for active learning in multiarmed bandits. In Proceedings of the International Conference on Algorithmic Learning heor, pp SpringerVerlag,. Chen, Y. and Krause, A. Nearoptimal batch mode active learning and adaptive submodular optimization. In Proceedings of the International Conference on Machine Learning. icml.cc / Omnipress, 3. Contal, E., Buffoni, D., Robicquet, A., and Vaatis, N. Parallel Gaussian process optimization with upper confidence bound and pure exploration. In Machine Learning and Knowledge Discover in Databases, volume 888, pp Springer Berlin Heidelberg, 3. Cover,. M. and homas, J. A. Elements of Information heor. WileInterscience, 99. Dani, V., Haes,. P., and Kakade, S. M. Stochastic linear optimization under bandit feedback. In Proceedings of the st Annual Conference on Learning heor, pp , 8. de Freitas, N., Smola, A. J., and Zoghi, M. Exponential regret bounds for Gaussian process bandits with deterministic observations. In Proceedings of the 9th International Conference on Machine Learning. icml.cc / Omnipress,. Desautels,., Krause, A., and Burdick, J.W. Parallelizing explorationexploitation tradeoffs with Gaussian process bandit optimization. In Proceedings of the 9th International Conference on Machine Learning, pp icml.cc / Omnipress,. Dutkh, D., Poncet, R, and Dias, F. he VOLNA code for the numerical modelling of tsunami waves: generation, propagation and inundation. European Journal of Mechanics B/Fluids, 3:598 65,.
9 Flake, G. W. and Lawrence, S. Efficient SVM regression training with SMO. Machine Learning, 46(3):7 9,. Floudas, C.A. and Pardalos, P.M. Optimization in Computational Chemistr and Molecular Biolog: Local and Global Approaches. Nonconvex Optimization and Its Applications. Springer,. Grunewalder, S., Audibert, JY., Opper, M., and Shawe alor, J. Regret bounds for Gaussian process bandit problems. In Proceedings of the International Conference on Artificial Intelligence and Statistics, pp Press,. Hill, E. M., Borrero, J. C., Huang, Z., Qiu, Q., Banerjee, P., Natawidjaja, D. H., Elosegui, P., Fritz, H. M., Suwargadi, B. W., Prananto, I. R., Li, L., Macpherson, K. A., Skanavis, V., Snolakis, C. E., and Sieh, K. he Mw 7.8 Mentawai earthquake: Ver shallow source of a rare tsunami earthquake determined from tsunami field surve and nearfield GPS data. J. Geophs. Res., 7: B64,. Jones, D. R., Schonlau, M., and Welch, W. J. Efficient global optimization of expensive blackbox functions. Journal of Global Optimization, 3(4):455 49, December 998. Srinivas, N., Krause, A., Kakade, S., and Seeger, M. Gaussian process optimization in the bandit setting: No regret and experimental design. In Proceedings of the International Conference on Machine Learning, pp. 5. icml.cc / Omnipress,. Srinivas, N., Krause, A., Kakade, S., and Seeger, M. Informationtheoretic regret bounds for Gaussian process optimization in the bandit setting. IEEE ransactions on Information heor, 58(5):35 365,. Stefanakis,. S., Dias, F., Vaatis, N., and Guillas, S. Longwave runup on a plane beach behind a conical island. In Proceedings of the World Conference on Earthquake Engineering,. Stefanakis,. S., Contal, E., Vaatis, N., Dias, F., and Snolakis, C. E. Can small islands protect nearb coasts from tsunamis? An active experimental design approach. arxiv preprint arxiv: , 3. Wang, G. and Shan, S. Review of metamodeling techniques in support of engineering design optimization. Journal of Mechanical Design, 9(4):37 38, 7. Ziemba, W.. and Vickson, R. G. Stochastic optimization models in finance. World Scientific Singapore, 6. Kleinberg, R. Nearl tight bounds for the continuumarmed bandit problem. In Advances in Neural Information Processing Sstems 7, pp Press, 4. Krause, A. and Ong, C.S. Contextual Gaussian process bandit optimization. In Advances in Neural Information Processing Sstems 4, pp ,. Kuss, M., Pfingsten,., Csató, L., and Rasmussen, C.E. Approximate inference for robust Gaussian process regression. Max Planck Inst. Biological Cbern., ubingen, Germanech. Rep, 36, 5. Mockus, J. Baesian approach to global optimization. Mathematics and its applications. Kluwer Academic, 989. Osborne, Michael. Baesian Gaussian processes for sequential prediction, optimisation and quadrature. PhD thesis, Oxford Universit New College,. Rasmussen, C. E. and Williams, C. Gaussian Processes for Machine Learning. Press, 6. Snoek, J., Larochelle, H., and Adams, R. P. Practical baesian optimization of machine learning algorithms. In Advances in Neural Information Processing Sstems 5, pp ,.
Towards scaling up Markov chain Monte Carlo: an adaptive subsampling approach
Towards scaling up Markov chain Monte Carlo: an adaptive subsampling approach Rémi Bardenet Arnaud Doucet Chris Holmes Department of Statistics, University of Oxford, Oxford OX1 3TG, UK REMI.BARDENET@GMAIL.COM
More informationSubspace Pursuit for Compressive Sensing: Closing the Gap Between Performance and Complexity
Subspace Pursuit for Compressive Sensing: Closing the Gap Between Performance and Complexity Wei Dai and Olgica Milenkovic Department of Electrical and Computer Engineering University of Illinois at UrbanaChampaign
More informationHow to Use Expert Advice
NICOLÒ CESABIANCHI Università di Milano, Milan, Italy YOAV FREUND AT&T Labs, Florham Park, New Jersey DAVID HAUSSLER AND DAVID P. HELMBOLD University of California, Santa Cruz, Santa Cruz, California
More informationCompeting in the Dark: An Efficient Algorithm for Bandit Linear Optimization
Competing in the Dark: An Efficient Algorithm for Bandit Linear Optimization Jacob Abernethy Computer Science Division UC Berkeley jake@cs.berkeley.edu Elad Hazan IBM Almaden hazan@us.ibm.com Alexander
More informationNearOptimal Sensor Placements in Gaussian Processes
Carlos Guestrin Andreas Krause Ajit Paul Singh School of Computer Science, Carnegie Mellon University GUESTRIN@CS.CMU.EDU KRAUSEA@CS.CMU.EDU AJIT@CS.CMU.EDU Abstract When monitoring spatial phenomena,
More informationLearning to Select Features using their Properties
Journal of Machine Learning Research 9 (2008) 23492376 Submitted 8/06; Revised 1/08; Published 10/08 Learning to Select Features using their Properties Eyal Krupka Amir Navot Naftali Tishby School of
More informationAn Introduction to Variable and Feature Selection
Journal of Machine Learning Research 3 (23) 11571182 Submitted 11/2; Published 3/3 An Introduction to Variable and Feature Selection Isabelle Guyon Clopinet 955 Creston Road Berkeley, CA 9478151, USA
More informationLazier Than Lazy Greedy
Baharan Mirzasoleiman ETH Zurich baharanm@inf.ethz.ch Lazier Than Lazy Greedy Ashwinumar Badanidiyuru Google Research Mountain View ashwinumarbv@google.com Amin Karbasi Yale University amin.arbasi@yale.edu
More informationTHE adoption of classical statistical modeling techniques
236 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 28, NO. 2, FEBRUARY 2006 Data Driven Image Models through Continuous Joint Alignment Erik G. LearnedMiller Abstract This paper
More informationDecoding by Linear Programming
Decoding by Linear Programming Emmanuel Candes and Terence Tao Applied and Computational Mathematics, Caltech, Pasadena, CA 91125 Department of Mathematics, University of California, Los Angeles, CA 90095
More informationRepresentation Learning: A Review and New Perspectives
1 Representation Learning: A Review and New Perspectives Yoshua Bengio, Aaron Courville, and Pascal Vincent Department of computer science and operations research, U. Montreal also, Canadian Institute
More informationStatistical Inference in TwoStage Online Controlled Experiments with Treatment Selection and Validation
Statistical Inference in TwoStage Online Controlled Experiments with Treatment Selection and Validation Alex Deng Microsoft One Microsoft Way Redmond, WA 98052 alexdeng@microsoft.com Tianxi Li Department
More informationSteering User Behavior with Badges
Steering User Behavior with Badges Ashton Anderson Daniel Huttenlocher Jon Kleinberg Jure Leskovec Stanford University Cornell University Cornell University Stanford University ashton@cs.stanford.edu {dph,
More informationTHE PROBLEM OF finding localized energy solutions
600 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 45, NO. 3, MARCH 1997 Sparse Signal Reconstruction from Limited Data Using FOCUSS: A Reweighted Minimum Norm Algorithm Irina F. Gorodnitsky, Member, IEEE,
More informationEVALUATION OF GAUSSIAN PROCESSES AND OTHER METHODS FOR NONLINEAR REGRESSION. Carl Edward Rasmussen
EVALUATION OF GAUSSIAN PROCESSES AND OTHER METHODS FOR NONLINEAR REGRESSION Carl Edward Rasmussen A thesis submitted in conformity with the requirements for the degree of Doctor of Philosophy, Graduate
More informationON THE DISTRIBUTION OF SPACINGS BETWEEN ZEROS OF THE ZETA FUNCTION. A. M. Odlyzko AT&T Bell Laboratories Murray Hill, New Jersey ABSTRACT
ON THE DISTRIBUTION OF SPACINGS BETWEEN ZEROS OF THE ZETA FUNCTION A. M. Odlyzko AT&T Bell Laboratories Murray Hill, New Jersey ABSTRACT A numerical study of the distribution of spacings between zeros
More informationScalable Collaborative Filtering with Jointly Derived Neighborhood Interpolation Weights
Seventh IEEE International Conference on Data Mining Scalable Collaborative Filtering with Jointly Derived Neighborhood Interpolation Weights Robert M. Bell and Yehuda Koren AT&T Labs Research 180 Park
More informationLarge Scale Online Learning of Image Similarity Through Ranking
Journal of Machine Learning Research 11 (21) 1191135 Submitted 2/9; Revised 9/9; Published 3/1 Large Scale Online Learning of Image Similarity Through Ranking Gal Chechik Google 16 Amphitheatre Parkway
More informationGenerative or Discriminative? Getting the Best of Both Worlds
BAYESIAN STATISTICS 8, pp. 3 24. J. M. Bernardo, M. J. Bayarri, J. O. Berger, A. P. Dawid, D. Heckerman, A. F. M. Smith and M. West (Eds.) c Oxford University Press, 2007 Generative or Discriminative?
More informationFrom Few to Many: Illumination Cone Models for Face Recognition Under Variable Lighting and Pose. Abstract
To Appear in the IEEE Trans. on Pattern Analysis and Machine Intelligence From Few to Many: Illumination Cone Models for Face Recognition Under Variable Lighting and Pose Athinodoros S. Georghiades Peter
More informationOneyear reserve risk including a tail factor : closed formula and bootstrap approaches
Oneyear reserve risk including a tail factor : closed formula and bootstrap approaches Alexandre Boumezoued R&D Consultant Milliman Paris alexandre.boumezoued@milliman.com Yoboua Angoua NonLife Consultant
More informationHighRate Codes That Are Linear in Space and Time
1804 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL 48, NO 7, JULY 2002 HighRate Codes That Are Linear in Space and Time Babak Hassibi and Bertrand M Hochwald Abstract Multipleantenna systems that operate
More informationEfficient algorithms for online decision problems
Journal of Computer and System Sciences 71 (2005) 291 307 www.elsevier.com/locate/jcss Efficient algorithms for online decision problems Adam Kalai a,, Santosh Vempala b a Department of Computer Science,
More informationAn efficient reconciliation algorithm for social networks
An efficient reconciliation algorithm for social networks Nitish Korula Google Inc. 76 Ninth Ave, 4th Floor New York, NY nitish@google.com Silvio Lattanzi Google Inc. 76 Ninth Ave, 4th Floor New York,
More informationFast Solution of l 1 norm Minimization Problems When the Solution May be Sparse
Fast Solution of l 1 norm Minimization Problems When the Solution May be Sparse David L. Donoho and Yaakov Tsaig October 6 Abstract The minimum l 1 norm solution to an underdetermined system of linear
More informationLearning Deep Architectures for AI. Contents
Foundations and Trends R in Machine Learning Vol. 2, No. 1 (2009) 1 127 c 2009 Y. Bengio DOI: 10.1561/2200000006 Learning Deep Architectures for AI By Yoshua Bengio Contents 1 Introduction 2 1.1 How do
More informationCLoud Computing is the long dreamed vision of
1 Enabling Secure and Efficient Ranked Keyword Search over Outsourced Cloud Data Cong Wang, Student Member, IEEE, Ning Cao, Student Member, IEEE, Kui Ren, Senior Member, IEEE, Wenjing Lou, Senior Member,
More informationWhen Is Nearest Neighbor Meaningful?
When Is Nearest Neighbor Meaningful? Kevin Beyer, Jonathan Goldstein, Raghu Ramakrishnan, and Uri Shaft CS Dept., University of WisconsinMadison 1210 W. Dayton St., Madison, WI 53706 {beyer, jgoldst,
More informationMore Generality in Efficient Multiple Kernel Learning
Manik Varma manik@microsoft.com Microsoft Research India, Second Main Road, Sadashiv Nagar, Bangalore 560 080, India Bodla Rakesh Babu rakeshbabu@research.iiit.net CVIT, International Institute of Information
More informationIEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 52, NO. 4, APRIL 2006 1289. Compressed Sensing. David L. Donoho, Member, IEEE
IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 52, NO. 4, APRIL 2006 1289 Compressed Sensing David L. Donoho, Member, IEEE Abstract Suppose is an unknown vector in (a digital image or signal); we plan to
More information