How To Solve Krr

Example-based Learning for Single-Image Super-resolution Kwang In Kim 1 and Younghee Kwon 2 1 Max-Planck-Institute für biologische Kybernetik, Spemannstr. 38, D-72076 Tübingen, Germany 2 Korea Advanced Institute of Science and Technology, 373-1 Kusong-dong, Yusong-Ku, Taejon, Korea Abstract. This paper proposes a regression-based method for singleimage super-resolution. Kernel ridge regression (KRR) is used to estimate the high-frequency details of the underlying high-resolution image. A sparse solution of KRR is found by combining the ideas of kernel matching pursuit and gradient descent, which allows time-complexity to be kept to a moderate level. To resolve the problem of ringing artifacts occurring due to the regularization effect, the regression results are postprocessed using a prior model of a generic image class. Experimental results demonstrate the effectiveness of the proposed method. 1 Introduction Single-image super-resolution refers to the task of constructing a high-resolution enlargement of a given low-resolution image. This problem is inherently ill-posed as there are generally multiple high-resolution images that can produce the same low-resolution image. Accordingly, prior information is required to approach this problem. Often, this prior information is available either in the explicit form of an energy functional defined on the image class [9, 10], or in the implicit form of example images leading to example-based super-resolution [1 3, 5]. Previous example-based super-resolution algorithms can be characterized as nearest neighbor (NN)-based estimations [1 3] : during the training phase, pairs of low-resolution and the corresponding high-resolution image patches (sub-windows of images) are collected. Then, in the super-resolution phase, each patch of the given low-resolution image is compared to the stored lowresolution patches, and the high-resolution patch corresponding to the nearest low-resolution patch is selected as the output. For instance, Freeman et al. [2] posed the image super-resolution as the problem of estimating missing highfrequency details by interpolating the input low-resolution image into the desired scale (which results in a blurred image). Then, the super-resolution was performed by the NN-based estimation of high-frequency patches based on the corresponding patches of input low-frequency image. Although this method (and also other NN-based methods) has already shown an impressive performance, there is still room for improvement if one views the

2 image super-resolution as a regression problem, i.e., finding a map f from the space of low-resolution image patches X to the space of target high-resolution patches Y. It is well known in the machine learning community that NN-based estimation suffers from overfitting where one obtains a function which explains the training data perfectly yet cannot be generalized to unknown data. In the super-resolution, this can result in noisy reconstructions at complex image regions (cf. Sect. 3). Accordingly, it is reasonable to expect that NN-based methods can be improved by adopting learning algorithms with regularization capability to avoid overfitting. Based on the framework of Freeman et al. [2], Kim et al. posed the problem of estimating the high-frequency details as a regression problem which is then resolved by support vector regression (SVR) [6]. Meanwhile, Ni and Nguyen utilized SVR in the frequency domain and posed the super-resolution as a kernel learning problem [7]. While SVR produced a significant improvement over existing example-based methods, it has several drawbacks in building a practical system: 1. As a regularization framework, SVR tends to smooth the sharp edges and produce an oscillation along the major edges. This might lead to low reconstruction error on average, but is visually implausible; 2. SVR results in a dense solution, i.e., the regression function is expanded in the whole set of training data points and accordingly is computationally demanding both in training and in testing. 3 The current work extends the framework of Kim et al. [6]. A kernel ridge regression (KRR) is utilized for the regression. Due to the observed optimality of ɛ at (nearly) 0 for SVR in our previous study, the only difference between SVR and KRR in the proposed setting is their loss functions (L 1 - and L 2 - loss, respectively). The L 2 -loss adopted by KRR is differentiable and facilitates gradient-based optimization. To reduce the time complexity of KRR, a sparse basis is found by combining the idea of the kernel matching pursuit (KMP) [11] and gradient descent such that the time complexity and the quality of superresolution can be traded. As the regularizer of KRR is the same as that of SVR, the problem of oscillation along the major edges still remains. This is resolved by exploiting a prior over image structure proposed by Tappen et al. [9]. 2 Regression-based Image Super-resolution Base System. Adopting the framework of Freeman et al. [2], for the super-resolution of a given image, we estimate the corresponding missing high-frequency details based on its interpolation into the desired scale, which in this work is obtained by the bicubic interpolation. Furthermore, based on the conditional independence assumption of high- and low-frequency components given midfrequency components of an image [2], the estimation of high-frequency components (Y ) is performed based on the Laplacian of the bicubic interpolation (X). The Y is then added to the bicubic to produce the super-resolved image Z. 3 In our simulation, the optimum value of ɛ for the ɛ-insensitive loss function of SVR was close to zero.

3 To retain the complexity of the resulting regression problem at a moderate level, a patch-based approach is taken where the estimation of the values of Y at specific locations N N (Y (x, y)) is performed based on only the values of X at corresponding locations N M (X(x, y)), where N G (S(x, y)) represents a G-sized square window (patch) centered at the location (x, y) of the image S. Then, during the super-resolution, X is scanned with a small window (of size M) to produce a patch-valued regression result (of size N) for each pixel. This results in a set of candidate pixels for each location of Z (as the patches are overlapping with their neighbors), which are then combined to make the final estimation (details will be provided later). The training images for the regressor are obtained by blurring and subsampling (by bicubic resampling) a set of highresolution images to constitute a set of low- and high-resolution image pairs. The training image patch pairs are randomly sampled therein. To increase the efficiency of the training set, the data are contrast-normalized ([2]): during the construction of the training set both the input image patch and corresponding desired patches are normalized by dividing them by the L 1 -norm of the input patch. For an unseen image patch, the input is again normalized before the regression and the corresponding output is inverse normalized. For a given set of training data points {(x 1, y 1 ),..., (x l, y l )} IR M IR N, we minimize the following regularized cost functional O({f 1,..., f N }) = ( 1 (f i (x j ) y i 2 j) 2 + 1 ) 2 λ f i 2 H, (1) i=1,...,n j=1,...,l where y j = [yj 1,..., yn j ] and H is a reproducing kernel Hilbert space (RKHS). Due to the reproducing property, the minimizer of above functional is expanded in kernel functions: f i ( ) = a i jk(x j, ), for i = 1,..., N (2) j=1,...,l where k is the generating kernel for H which, we choose as a Gaussian kernel (k(x, y) = exp ( x y 2 /σ k ) ). Equation (1) is the sum of individual convex cost functionals for each scalar-valued regressor and can be minimized separately. However, by tying the regularization parameter λ and the kernel k we can reduce the time complexity of training and testing down to the case of scalar-valued regression, as in this case the kernel matrix can be shared: plugging (2) into (1) and noting the convexity of (1) yields A = (K + λi) 1 Y, (3) where Y = [y 1,..., y l ] and the i-th column of A constitutes the coefficient vector a i = [a i 1,..., a i l ] for the i-th regressor. Sparse Solution. As evident from (2) and (3), the training and testing time of KRR is O(l 3 ) and O(M l), respectively, which becomes prohibitive even for a relatively small number of training data points (e.g., l > 10, 000). One

4 way of reducing the time complexity is to trade it off with the optimality of the solution by finding the minimizer of (1) only within the span of a basis set {k(b 1, ),..., k(b lb, )} (l b l): f i ( ) = a i jk(b j, ), for i = 1,..., N. (4) j=1,...,l b In this case, the solution is obtained by A = (K bx K bx + λk bb ) 1 K bx Y, (5) where [K bx(i,j) ] lb,l = k(b i, x j ) and [K bb(i,j) ] lb,l b = k(b i, b j ), and accordingly the time complexity reduces to O(M l b ) for testing. For a given fixed basis points B = {b 1,..., b lb }, the time complexity of computing the coefficient matrix A is O(lb 3 + l l b M). In general, the total training time depends on the method of finding B. In KMP [11, 4], the basis points are selected from the training data points in an incremental way: for given n 1 basis points, the n-th basis is chosen such that the cost functional (1) is minimized when the A is optimized accordingly. The exact implementation of KMP costs O(l 2 )-time for each step. Another possibility is to note the differentiability of the cost functional (4) which leads to gradient-based optimization to construct B. Assuming that the evaluation of the derivative of k with respect to a basis vector takes O(M)-time, the evaluation of derivative of (1) with respect to B and corresponding coefficient matrix A takes O(M l l b + l lb 2 )-time. Because of the increased flexibility, in general, gradient-based methods can lead to a better optimization of the cost functional (1) than selection methods as already demonstrated in the context of sparse Gaussian process (GP) regression [8]. However, due to the non-convexity of (1) with respect to B, it is susceptible to local minima and accordingly a good heuristic is required to initialize the solution. In this paper, we use a combination of KMP and gradient descent. The basic idea is to assume that at the n-th step of KMP, the chosen basis point b n plus the accumulation of basis points obtained until the (n 1)-th step (B n 1 ) is a good initial point. Then, at each step of KMP, B n can be subsequently optimized by gradient descent. Naive implementation of this idea is still very expensive. To reduce further the complexity, the following simplifications are adopted: 1. In the KMP step, instead of evaluating the whole training set for choosing b n, only l c (l c l) points are considered; 2. Gradient descent of B n (M) and corresponding A 4 (1:n,:) are performed only at the every r-th KMP step. Instead, for each KMP step, only b n and A (n,:) are optimized. In this case, the gradient can be evaluated at O(M l). 5 4 With a slight abuse of the Matlab notation, A (m:n,:) stands for the submatrix of A obtained by extracting the rows of A from m to n. 5 Similarly to [4], A(n) can be analytically calculated at O(M l)-cost: A (n) = K bx(n,:)(y K bx(1:n 1,:)A (1:n 1,:) ) λk nb A (1:n 1,:) K bx(n,:) K bx(n,:) + λ. (6)

5 At the n-th step, the l c -candidate basis points for KMP is selected based on a rather cheap criterion: we use the difference between the function output obtained at the (n 1)-th step and the estimated desired response of full KRR for each training data points which is then approximated by the localized KRR: for a training data point x i, its NNs are collected in the training set and the full KRR is trained based on only these NNs. The output of this localized KRR for x i gives the estimation of the desired response for x i. It should be noted that these local KRRs cannot be directly applied for regression as they might interpolate poorly on non-training data points. Once computed at the beginning, the estimated desired responses are fixed throughout the whole optimization process. To gain an insight into the performances of different sparse solution method, a set of preliminary experiments has been performed with KMP, gradient descent (with basis initialized by kmeans algorithm), and the proposed combination of KMP and gradient descent with 10,000 training data points. Figure 1 summarizes the results. Both gradient descent methods outperform KMP, while the combination with KMP provides a better performance. This could be attributed to the better initialization of the solution for the subsequent gradient descent step. 58 KMP Gradient descent KMP+gradient descent 54 50 cost 46 42 38 0 50 100 200 300 # basis points Fig. 1. Performance of the different sparse solution methods evaluated in terms of the cost functional (1). A fixed set of hyper-parameters were used such that the comparison can be made directly in (1). Combining Candidates. It is possible to construct a super-resolved image based on only the scalar-valued regression (i.e., N = 1). However, we propose to predict a patch-valued output such that for each pixel, N different candidates are generated. These candidates constitutes a 3-D image Z where the third dimension corresponds the candidates. This setting is motivated by the observation that 1. by sharing the hyper-parameters, the computational complexity of resulting patch-valued learning reduces to the scalar-valued learning; 2. the candidates contain information of different input image locations which are actually diverse enough such that the combination can boost the performance: in our preliminary experiments, constructing an image by choosing the best and the worst

6 (in terms of the distance to the ground truth) candidates from each 2-D location of Z resulted in an average signal-to-noise ratio (SNR) difference of 8.24dB. Certainly, the ground truth is not available at actual super-resolution stage and accordingly a way of constructing a single pixel out of N candidates is required. One straightforward way is to construct the final estimation as a convex combination of candidates based on a certain confidence measure. For instance, by noting that the (sparse) KRR corresponds to the maximum a posteriori estimation with the (sparse) GP prior [8], one could utilize the predictive variance as a basis for the selection. In the preliminary experiments this resulted in an improvement over the scalar-valued regression. However, a better prediction was obtained when the confidence estimation is obtained based not only on the input patches but also on the context of neighbor reconstructions. For this, a set of linear regressors is trained such that for each location (x, y), they receive a patch of output images Z (NL (x,y),:) and produce the estimation of differences ({d 1 (x, y),..., d N (x, y)}) between the unknown desired output and each candidate. The final estimation of pixel value for an image location (x, y) is then obtained as the convex combination of candidates given in the form of a softmax: Y (x, y) = w i (x, y)z(x, y, i), (7) i=1,...,n where w i (x, y) = exp ( di(x,y) ) [ dj(x,y) σ C / j=1,...,n exp( σ C ) ]. For the experiments in this paper, we set M = 49(7 7), N = 25(5 5), L = 49(7 7), σ k = 0.025, σ C = 0.03, and λ = 0.5 10 7. The values are obtained based on a set of separate validation images. The number of basis points for KRR (l b ) is determined to be 300 as the trade off between the accuracy and the time complexity. In the super-resolution experiments, the combination of candidates based on these parameters resulted in an average SNR increase of 0.43dB over the scalar-valued regression. Post-processing Based on Image Prior. As demonstrated in Fig. 2.b, the result of the proposed regression-based method is significantly better than the bicubic interpolation. However, detailed visual inspection along the major edges (edges showing rapid and strong change of pixel values) reveals ringing artifacts (oscillation occurred along the edges). In general, regularization methods (depending on the specific class of regularizer) including KRR and SVR tend to fit the data with a smooth function. Accordingly, at the sharp changes of the function (edges in the case of images) oscillation occurs to compensate the resulting loss of smoothness. While this problem can indirectly be resolved by imposing less regularization at the vicinity of edges, more direct approach is to rely on the prior knowledge of discontinuity of images. In this work, we use a modification of the natural image prior (NIP) framework proposed by Tappen et al. [9]: P ({x} {y}) = 1 [ ( ) α ] ˆxi ˆx j exp [ ( ) ] 2 ˆxi y i exp, (8) C (i,j N S (i)) σ N where {y} represents the observed variables corresponding to the pixel values of Y, {x} represents the latent variable, and N S (i) stands for the 8-connected i σ R

7 neighbors of the pixel location i. While the second product term has the role of preventing the final solution flowing far away from the input regression result Y, the first product term tends to smooth the image based on the costs ˆx i ˆx j. The role of α(< 1) is to re-weight the costs such that the largest difference is stressed relatively less than the others such that large changes of pixel values are relatively less penalized. Furthremore, the cost term ˆx i ˆx j α becomes piece-wise concave with extreme points at N S (i) such that if the second term is removed, the maximum probability for a pixel i is achieved by assigning it with the value of a neighbor, rather than a certain weighted average of neighbors which might have been the case when α > 1. Accordingly, this distribution prefers a strong edge rather than a set of small edges and can be used to resolve the problem of smoothing around major edges. The optimization of (8) is performed by belief propagation (BP) similarly to [9]. To facilitate the optimization, we reuse the candidate set generated from the regression step such that the best candidates are chosen by BP. a b c d e f Fig. 2. Example of super resolution: a. bicubic, b regression result. c. post-processed result of b based on NIP, d. Laplacian of bicubic with major edges displayed as green pixels, and e and f. enlarged portions of a-c from left to right. Optimizing (8) throughout the whole image region can lead to degraded results as it tends to flatten the textured area, especially, when the contrast is low such that the contribution of the second term is small. 6 This problem is resolved by applying the (modification of) NIP only at the vicinity of major edges. Based on the observation that the input images are blurred and accordingly very high spatial frequency components are removed, the major edges are found by thresholding each pixel of Laplacian of the input image using L 2 and L norms of the local patches encompassing it. It should be noted that the major edge is in general different from the object contour. For instance, in Fig. 2.d, the bound- 6 In original work of Tappen et al. [9], this problem does not happen as the candidates are 2 2-size image patches rather than individual pixels.

8 ary between the chest of the duck and water is not detected as major edges as the intensity variations are not significant across the boundary. In this case, no visible oscillation of pixel values are observed in the original regression result. The parameters α, σ N, and σ R are determined at 0.85, 200 and 1, respectively. While the improvement in terms of SNR is less significant (on average 0.04dB from the combined regression result) the improved visual quality at major edges demonstrate the effectiveness of NIP (Fig. 2). 3 Experiments The proposed method was evaluated based on a set of high- and low-resolution image pairs (Fig. 3) which is disjoint from the training images. The desired resolution is twice the input image along each dimension. The number of training data points is 200,000 where it took around a day to train the sparse KRR on a 2.5GHz PC. For comparison, several different example-based image superresolution methods were evaluated, which include Freeman et al. s NN-based method [2], Tappen et al. s NIP [9], 7 and Kim et al. s SVR-based method [6] (trained based on only 10,000 data points). Fig. 3. Thumbnails of test images: the images are indexed by numbers arranged in the raster order. Figure 4 shows examples of super-resolution results. All the example-based super-resolution methods outperform the bicubic interpolation in terms of visual plausibility. The NN-based method and the original NIP produced sharper images at the expense of introducing noise which, even with the improved visual quality, lead to lower SNR values than the bicubic interpolations. The SVR produced less noisy images. However it generated smoothed edges and perceptually distracting ring artifacts which have disappeared for the proposed method. Disregarding the post-processing stage, we measured on average 0.69dB improvement of SNRs for the proposed method from the SVR. This could be attributed to the sparsity of the solution which enabled training on a large data set and the 7 The original NIP algorithm was developed for super-resolving the NN-subsampled image (not bicubic resampling which is used for experiments with all the other methods). Accordingly, for the experiments with NIP, the low resolution images were generated by NN subsampling. The visual qualities of the super-resolution results are not significantly different from the results obtained from bicubic resampling. However, the quantitative results should not be directly compared with other methods.

9 effectiveness of the candidate combination scheme. Moreover, in comparison to SVR the proposed method requires much less processing time: super-resolving a 256 256-size image into 512 512 requires around 25 seconds for the proposed method and 20 minutes for the SVR-based method. For quantitative comparison, SNRs of different algorithms are plotted in Fig. 5. a b c d e f g h i j k l Fig. 4. Results of different super-resolution algorithms on two images from Fig. 3: a-b. original, c-d. bicubic, e-f. SVR [6], g-h. NN-based method [2], i-j. NIP [9], and k-l. proposed method. 4 Conclusion This paper approached the problem of image super-resolution from a nonlinear regression viewpoint. A combination of KMP and gradient descent is adopted to obtain a sparse KRR solution which enabled a realistic application of regressionbased super-resolution. To resolve the problem of smoothing artifacts that occur due to the regularization, the NIP was adopted to post-process the regression result such that the edges are sharpen while the artifacts are suppressed. Comparison with the existing example-based image super-resolution methods demonstrated the effectiveness of the proposed method. Future work should include comparison and combination of various non-example-based approaches.

10 increase of SNR from bicubic 4 3 2 1 0 1 bicubic SVR NN NIP proposed 2 1 2 3 4 5 6 7 8 9 10 11 12 image index Fig. 5. Performance of different super-resolutions algorithms. Acknowledgment. The contents of this paper have greatly benefited from discussions with G. BakIr and C. Walder, and comments from anonymous reviewers. The idea of using localized KRR was originated by C. Walder. References 1. Baker, S., Kanade, T.: Limits on super-resolution and how to break them. IEEE Trans. Pattern Analysis and Machine Intelligence 24(9), 1167 1183 (2002) 2. Freeman, W.T., Jones, T.R., Pasztor, E.C.: Example-based super-resolution. IEEE Computer Graphics and Applications 22(2), 56 65 (2002) 3. Hertzmann, A., Jacobs, C.E., Oliver, N., Curless, B., Salesin, D.H.: Image analogies. In: Computer Graphics (Proc. Siggraph 2001), pp. 327 340. ACM Press, NY (2001) 4. Keerthi, S.S., Chu, W.: A matching pursuit approach to sparse gaussian process regression. In: Advances in Neural Information Processing Systems. MIT Press, Cambridge, MA (2005) 5. Kim, K.I., Franz, M.O., Schölkopf, B.: Iterative kernel principal component analysis for image modeling. IEEE Trans. Pattern Analysis and Machine Intelligence 27(9), 1351 1366 (2005) 6. Kim, K.I., Kim, D.H., Kim, J.H.: Example-based learning for image superresolution. In: Proc. the third Tsinghua-KAIST Joint Workshop on Pattern Recognition, pp. 140 148 (2004) 7. Ni, K., Nguyen, T.Q.: Image superresolution using support vector regression. IEEE Trans. Image Processing 16(6), 1596 1610 (2007) 8. Snelson, E., Ghahramani, Z.: Sparse gaussian processes using pseudo-inputs. In: Advances in Neural Information Processing Systems. MIT Press, Cambridge, MA (2006) 9. Tappen, M.F., Russel, B.C., Freeman, W.T.: Exploiting the sparse derivative prior for super-resolution and image demosaicing. In: Proc. IEEE Workshop on Statistical and Computational Theories of Vision (2003) 10. Tschumperlé, D., Deriche, R.: Vector-valued image regularization with pdes: a common framework for different applications. IEEE Trans. Pattern Analysis and Machine Intelligence 27(4), 506 517 (2005) 11. Vincent, P., Bengio, Y.: Kernel matching pursuit. Machine Learning 48, 165 187 (2002)