A Learning Based Method for Super-Resolution of Low Resolution Images

Transcription

1 A Learning Based Method for Super-Resolution of Low Resolution Images Emre Ugur June 1, 2004 Abstract The main objective of this project is the study of a learning based method for super-resolving low resolution images. The domain specific prior is incorporated into super resolution by the means of learning based estimation of missing details. Images are decomposed into fixed size patches in order to deal with time and space complexity. The problem is modeled by Markov Random Field which enforces resulting images to be spatially consistent. The spatial interactions are coupled with a similarity constraint which should be established between high-resolution training image patches and low resolution observations. 1 Introduction The main objective of this project is the study of a learning based method for super-resolving low resolution images. In super resolution methods, given a low resolution image or image sequence (a video), one tries to estimate the highly zoomed image or video, and extract the missing image details. Super resolution problem is an ill-posed inverse problem. Estimating details is an inverse problem since low resolution observation is the result of a smoothing and downsampling process. Additionally the problem is ill-posed because a large amount of information is lost and there exists a number of high resolution images which result in same low resolution image, when they are smoothed and downsampled. In the past two decades, the problem of estimation of fine details from low resolution images has attracted a number of researchers whose approaches can be classified in two main classes. First and the more common one is Reconstruction Based (RB) methods [5] which endeavor increasing the effective sampling density. These methods attempt to solve the problem by employing and fusing a number of low-resolution images. The images are of an underlying scene are positioned into a common coordinate frame by sub-pixel shifts of images, and a coherent high resolution image is obtained in a finer resolution grid. The approach employed in this report, locates in a relatively recent solution family, which is based on Learning-based techniques. A training set, which is composed of a number of high-resolution images/video, is used to later predict the details of lower resolution images/video. Both classes employ additional priors to 1

2 encourage generic image properties such as local smoothness, implicitly or explicitly. Although in RB, multiple low resolution observations of the same scene is required to estimate the high resolution counterpart, in Learning-based methods, it is proposed that only one low resolution image includes adequate information to predict the details and to super resolve the image. In [1], Super resolution is defined as a process whose goal is to increase the resolution and at the same time adding appropriate high frequency information. Therefore, [1] employs a large database of image pairs, which stores a rectangular patch of high frequency component a high resolution image and corresponding smoothed and downsampled low level counter-part. The relationship between middle and high frequency of natural images are captured and used to super resolve low resolution static images and movies. Although a zoom factor of x4 is established for static images, direct application of the approach is not successful in video. In [2], spatio-temporal consistencies and image formation/degradation processes are employed to estimate or hallucinate high resolution video. Since learning based approaches are more powerful when their application is limited to a specific domain, a database of facial expressions is obtained using a sequence (video) of high resolution images. Their low resolution counterpart is acquired using a local smoothing and downsampling process. Images are divided into patches, and spatial (via single image) and temporal (via time) consistencies are established among these image patches. After the training database is constructed, they tries to find Maximum A Priori high resolution image and illumination offset by finding out a template image. Template image is constructed from high resolution patches in the database, maximizing some constraints. Since this project is based on the approach of [2], the details of model and algorithm will be described in preceding sections. 1.1 Differences from original work The assumptions employed in this study differ from [2] in different aspects. First of all we assume that the illumination conditions are similar or same in high resolution training set data and low resolution observations. Thus, the problem is reduced to only estimate the high resolution image, simplifying the solution. In fact, the hallucination step of [2] is dropped, where a gradient descent algorithm is applied to solve a quadratic minimization problem. This reduction does not affect the importance of our solution since the most important part of the learning-based method, is to find a feasible template image. Other difference appears in registration of training and test data. In [2], the image is adjusted such that the outline of the face is in same coordinates in the database. Therefore, they avoid the unrealistic patch compositions in translational and rotational sense. In our experiments, we do not spend additional effort to adjust face location, and expect the Markovian Model avoid such an effect by forcing similar patches to become neighborhood. The experiments in [2] are mostly based on video, and they well employed the temporal prior or relation between consecutive images. Moreover, they tried to acquire realistic facial expressions from low resolution video where facial expressions change very slowly. In our experiments, our criteria of success is mostly based on the difference from real high resolution image of low resolution observation. In order to register large number of different facial expressions and head 2

3 placements to training set, our consecutive images are used as static images, where temporal relation between images are not established. Thus, experiments are performed only for super resolution of static images. The flow of this report will be as follows. In Section 2, the framework which learning based super resolution algorithm based on is described. A sequence of 75 images, that are captured in 192x304 resolution, are utilized to construct the facial expressions database. A number of artificially downsampled low resolution images are employed to test the system, in order to super resolve with a zoom factor of x16 in Section 3. Additionally, some parameters are tuned in the algorithm and model is modified in order to understand the dynamics of the system in various low resolution test images. In Section 4, conclusion of the study and future work is presented. 2 Framework The main motivation behind this graphical is the complementary usage of low resolution image data via a similarity measure and relations constructed among the training image patches via Markov Random Field. In the very first step, a database is constructed using a number of (in our case a sequence of) high resolution images. Building a database, that will contain all possible facial expressions and pixel combinations, is a very challenging task because of its enormous size and time requirement. Therefore, local models which are defined over image patches and considered independent of each other can be employed in order to define whole image scene. However, unrealistic face compositions may be created if no relation is defined among these local models. Therefore, in order to model spatial relations among patches, Markov Random Field is employed. The details of the statistical relationship established among patches will be described later. 2.1 Registration of training set Training database is composed of entities, which span different properties of the underlying image patch. More specifically, each entry contains (i) corresponding high resolution image patch, (ii) location of the patch in sample image, (iii) neighboring pixels (in order to establish realistic estimations among patches), and (iv) a feature vector [3] which stores different properties of patches low-resolution counterpart. Figure shows the contents of an entry. The size of the high-resolution patch depends on the zoom factor; for a zoom factor of x16, it is defined over a 16x16 grid. The creation and meaning of neighboring pixels is clear, they are used to compute the statistical local relation of neighboring pixels in MRF. The width of neighboring pixels may vary according to the prior of application domain. In order to achieve a more consistent solution in terms of smoothness of neighboring patches, it is desirable to store a wider neighborhood pixel array in entry. Feature vector or parent vector is first defined in [3] in order to define or store different properties of an image region in an hierarchical manner. More specifically, for any pixel, intensity pyramid, Gaussian pyramid, Laplacian pyramid, Horizontal and 1 This image and explanation is taken from [2]. 3

4 Figure 1: Database is composed of entries which are generated from image patches in the training set. Each entry contains the high resolution image patch, neighboring pixels (to be used in MRF interactions), a feature vector of low resolution counterpart, and the location in x-y coordinates. Vertical derivative pyramids are formed in a multi scale. Thus, feature vector is the combination of these pyramids, which provide information about both local features and general underlying image scene. In later steps, feature vector is employed to compute the similarity between different low-resolution image patches from training and test set. 2.2 Estimation process The estimation process is based on finding a unique template from which high resolution image is extracted. In the original work, as mentioned in Section 1.1, high resolution image and an intensity offset were two unknowns of the problem. In this project, we decrease search space, and only search for high resolution image, assuming there is no intensity offset (ie. training and test images are captured in similar conditions). Therefore, super resolution problem maps to finding Maximum A Posteriori (MAP) high resolution image H MAP : H MAP arg max HlogP (H L) where L is the low resolution observation. Above equation is marginalized over unknown template image T, which is composed of image patches in the training database. Therefore, P (H L) = T P (H, T L) If chain rule is applied to the above probabilistic formula, P (H L) = T P (H T, L)P (T L) is obtained. Bayes rules is used to obtain, P (H L) = { } P (L H, T )P (H T ) P (T L) P (L H) T Since we use only Markov Random Field model to relate different nodes, and there is no relation between two nodes which are no linked via MRF, no direct conditioning exists among such nodes. As a result P (L T, L) = P (L H), P (H L) = { } (L H, T )P (H T )P (T L) T 4

5 Since P (T L) has its maximum value around true high resolution solution, we will try to compute a unique (peak) template T to maximize the posterior using low resolution images and database entries. If posterior of P (T L) is approximated to highly concentrate around T = T (L), original posterior which is tried to be maximized, P (H L) becomes, P (H L) = P (L H)P (H T ) Therefore, Maximum a Posteriori high resolution image will be computed by, H MAP arg max{logp (L H)logP (H T )} In Section 2.3, the computation of peak template T will be presented. In the scope of this project, no further estimation step is taken by the means of any hallucination technique. Interested readers can look at Kanade et. al. ([4]) and Dedeoglu et. al. ([2]) for minimization processes employed to satisfy certain criteria. 2.3 Formation of unique template In previous section, it is described that in order to maximize P (T L), a unique template should be constructed from patches in database. Since nodes in L are conditionally dependent, Bayes rule can be applied to in maximization of P (T L), P (T L) P (L T )P (T ) = N P (L p T p)p (T ) (1) The peak template will be computed according to above formulation. By maximizing the first term of right hand side of the equation, the difference between low resolution observation and downsampled unknown template is minimized. Second term of right hand side provides a consistent template, where spatial consistency in the data is established by means of MRF modeling. p= Feature vector and maximizing similarity As defined in Section 2.1, feature vector is the means of storing low resolution patch properties. For training images, high resolution images are smoothed and downsampled to obtain low-resolution feature images. Each entry in the database is a quadruple, (t k, η k, f k, s k ), where, t k is high-resolution template patch, η k is neighboring pixels, f k is the feature vector computed at corresponding low-resolution pixel, and s k is the location of the patch, which is defined in x-y coordinates. In Equation 1, P (F T ) = N p=1 P (P f T p) is maximized by computing the similarity between different feature vectors of template patches and feature vector of low-resolution observation s patch. The location of these patches should be same in order to be comparable. In other words, for each entry k, a similarity measure P (F p = f p, T p = t k ) is computed for each location s k. f p is the feature vector of low resolution observation at location p = s k. P (F p = f p T p = t k ) { exp( fp f k ) if s k = p 0 ow. (2) 5

6 2.3.2 Spatial consistency and relationship among nodes in MRF For each overlapping high resolution patch pair in the training database, a likelihood computation, P (T p, T q), is performed to force the parches spatially consistent. Patched from different images or same image are merged, preserving their location relative to whole scene to form template patch configurations. Each template patch configuration is given a probability which defines the total spatial consistency of corresponding configuration. For example, if template is composed of patches from same image, the probability will be maximum. On the contrary, if neighboring patches of the template are not smooth in their joining points, the probability will be low. The probability of spatial consistency of template T, is the product of each pair inside the configuration, P (T ) = T p,t q (T p, T q). The compatibility function (T p, T q) uses the difference measure between intensity values of overlapping regions, ( (T p, T q) exp overlap (t k(u) n l (v)) 2 overlap (n k(u) t l (v)) 2 ) (3) Computation of T Using equations 2 and 3, T is computed by maximizing, arg max T N P (P f T p) p=1 T p,t q (T p, T q) (4) The computational complexity of searching for optimum solution is huge, so we should compute maximum of the above probabilistic formula by means of an approximate method. Iterated Conditional Modes (ICM) algorithm first constructs a feasible initial template, than performs random walk in search space, maximizing conditional probability defined over MRF. The algorithm is copied from [2], Algorithm 1 ICM Algorithm for T for all image patches p do T p arg max tk P (F p = f p T p = t k ) end for while not converged do T p arg max tk P (F p = f p T p = t k ) q Neighbors(p) (T p = t k, T q) end while 3 Experiments 3.1 Experiment parameters The database of facial expressions is generated from a 150 frame-long video of of a person who continuously and abruptly changes his expressions. As in [2], no ad- 6

7 ditional work is done to adjust the head to a common coordinate frame. However, the person whose video is captured tried not to move his head. The resolution of the training image is 192x304. Although images are taken as video, frames are behaved as static images, and no temporal relation among any image is incorporated. In order to construct the database, high resolution images are smoothed via a local smoothing operator, then a simple downsampling process is employed to obtain the low resolution (12x19) counterparts. The learning-based method, that is employed in this project, is mainly based on two important concepts, maximizing similarity and spatial consistency. Similarity measure is computed by taking difference of feature vectors of different low resolution image patches. Euclidean distance is used to take the distance of feature vectors, each of which is composed of intensity, Laplacian, horizontal and vertical derivatives. The effect of each variable will be examined later in this section. Other important objective is to establish spatial relationship between image patches. As described in Section in detail, Markov Random Field use neighborhood relation to compute the likelihood of any template configuration. Width of the neighborhood pixels is a very important parameter when defining the relative effect of spatial consistency on the solution. In our study, we take the pixel width 1. The real frame length of our video is 250. The last 50 images of the same video are registered as test images. Same video instance is used to extract test images because illumination conditions should be similar and head should be in similar coordinates in the underlying scene. It is almost impossible for any training set and test images to be same, because the subject changes his facial expressions unrealistically, and there is a 50 frame long period between two sets. 3.2 Results As it is demonstrated in Figure 2, learning-based super resolution method gives feasible solutions when a zooming factor of x16 is applied. The images of Figures 3(a) are samples from high resolution test image set. Test images are selected in order to show results of different facial expressions. High resolution images are smoothed and downsampled, in order obtain low resolution observations which serve as inputs to experiments (Figures 3(b)). If only feature vector similarity is employed in order to maximize the likelihood of any template configuration, spatially inconsistent solutions are obtained as shown in Figures 3(c). Since patches in the template behave independently, scene seems to be composed of different image blocks which are not similar in their overlapping areas. Figures 3(d) shows estimated high resolution images which seems better, more fluent since spatial interactions are enforced via Markov Random Field modeling. More specifically, while Figures 3(c) is the result of first step of ICM algorithm (in Section 2.3.3), in Figures 3(d) algorithm is completed. In Figure 3, enlarged version of the first image of Figures 3(c) and 3(d) shows better the effect of incorporating spacial interactions. Since images show relatively smooth patches in experiments, one important question arises, whether template images are composed of patches that belongs to small number of images. Figure 4 demonstrates that, although for each different template, there exists a small number of dominant images in the database, in general most of the images affect the generated template. Figure 5(a) shows the number of patches from each image which which is inserted to the template. In 7

8 Figure 5(b), each patch is painted with a different gray-level if its corresponding image is different than others. This figure clearly shows that different regions (or patches) of the template face is taken from different images in the database. After main results are obtained, next experiments are performed in order to understand the effect of different variables that are stored in feature vector. Remind that our feature vector is composed of intensity, laplacian, horizontal and vertical derivatives. Figure 6(b) shows the resulting template when laplacian component is deleted, Figure 6(c) is the result of omission of derivative components and in Figure 6(a), only intensity component is used. Figures show that horizontal and vertical derivative components are essential in feature vector but there is not much difference between with or without laplacian case. 4 Discussion and Future Work In this project, super resolution problem for a domain specific instance is examined, and feasible results are obtained. The power of learning based methods on domain limited applications is employed in order to achieve a zoom factor of x16. Patches of a training set of static images are registered into a database, storing their local properties and relationships. Markov Random Field is used to model the patch relations, leading to only local interactions. Patches from different images are fused in different combinations, and a probability for each template is computed employing neighborhood relations or overlapping similarities of each patch pair in the template configurations. Therefore, the objective becomes to maximize total of probability of each template configuration and similarity of the template with low resolution observation. Similarity of any template with low resolution observation is computed via Euclidean distance which is defined over feature vectors of the image patches. The results are feasible but far from being perfect. In different figures, one can easily see that image patches do not fit well in their neighborhood relation, even spatial interaction is enforced. There is a trade-off, enforcing spatial interaction and at the same time remaining similar to observation. One can increase pixel width, which is taken 1 in our project, to avoid from patchy images. Another solution may be the usage of varying size patches. Although the time and space complexity increase will be enormous, more fluent images can be obtained, at the same time, preserving similarity to low resolution observation References [1] C. M. Bishop, A. Blake, and B. Marthi, Super-resolution enhancement of video, In C. M. Bishop and B. Frey, editors, Proceedings Artificial Intelligence and Statistics. Society for Artificial Intelligence and Statistics, [2] G. Dedeoglu, T. Kanade, and J. August, High-Zoom Video Hallucination by Exploiting Spatio-Temporal Regularities Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 04), June, [3] DeBonet and P. A. Viola. A non-parametric multi-scale statistical model for natural images, In Advances in Neural Information Processing Systems (NIPS), volume 10. The MIT Press,

9 [4] S. Baker and T. Kanade. Limits on super-resolution and how to break them, IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(9): , [5] A. Papoulis, Generalized sampling theorem, IEEE Transactions on Circuits and Systems, vol. 24, pp , November

10 Figure 2: (a) High resolution test images in scale 192x304 (b) Low resolution observations downsampled from above in scale 12x19 (c) Template high resolution estimate without spatial interaction 10

11 Figure 3: (a) Enlarged of Figure 3(c) without spacial interaction 11 (b) Enlarged of Figure 3(c) with spacial interaction

12 Figure 4: Although there exists dominant images, templates tend to take patches from almost all images. Number of times images appeared in the template Number of times images appeared in the template Number of times images appeared in the template Number of times images appeared in the template Image Index in Training Set Image Index in Training Set Image Index in Training Set Image Index in Training Set (a) Number of images in the template configuration for Figure 2 (b) Different gray-levels assigned for different image patches for Figure 2 Figure 5: Feature vector is modified (b) No laplacian component. (a) No modification. (c) No vertical or horizontal derivative component. (d) Only intensity component. 12