A Review on reference Picture Memory and Coding Efficiency of Multiview Video

Transcription

1 Systems and Computers in Japan, Vol. 38, No. 5, 2007 Translated from Denshi Joho Tsushin Gakkai Ronbunshi, Vol. J89-D, No. 1, January 2006, pp Low-Delay Multiview Video Coding for Free-Viewpoint Video Communication Hideaki Kimata, 1 Masaki Kitahara, 2 Kazuto Kamikura, 1 Yoshiyuki Yashima, 1 Toshiaki Fujii, 3 and Masayuki Tanimoto 3 1 NTT Cyber Space Laboratories, NTT Corporation, Yokosuka, Japan 2 NTT Advanced Technology Corporation, Yokohama, Japan 3 Department of Electrical Engineering and Computer Science, Graduate School of Engineering, Nagoya University, Nagoya, Japan SUMMARY 1. Introduction We have proposed free-viewpoint video communications, in which a viewer can change the viewpoint and viewing angle when receiving and watching video content. A free-viewpoint video consists of several views, whose viewpoints are different. To freely and instantaneously change the viewpoint and view angle, a random access capability to decode the requested view with little delay is necessary. In this paper, a multiview video coding method to achieve high coding efficiency with low-delay random access functionality is proposed. In the proposed method, the GOP is the basic unit of a view, and selective reference picture memory management is applied to multiple GOPs to improve coding efficiency. In addition, the coding method of disparity vectors, which utilizes the camera arrangement, is proposed Wiley Periodicals, Inc. Syst Comp Jpn, 38(5): 14 29, 2007; Published online in Wiley InterScience ( DOI /scj Key words: free viewpoint; free viewpoint video; multi-viewpoint video; disparity compensation; H.264. Free-viewpoint video is a visual representation in which a viewer can change the viewpoint freely as desired when watching the video content. We can provide immersive visual experiences to viewers showing such highly interactive and high-quality video content. The visual representation in which viewpoints are changed has been applied in movies and sport relay broadcasts. For instance, in movies, special camerawork is used to allow a viewer to see a scene as if from different viewpoints continuously while time is stopped. To create such camerawork, camera images are captured from multiple viewpoints, and then images for the camerawork are generated as virtual camera positions as determined by a movie maker in a studio. These generated images are used for transmission and broadcasting. In these traditional applications, the viewer cannot change the viewpoint. On the contrary, we have proposed free-viewpoint TV and free-viewpoint video communication [1 3]. In these proposed applications, interactivity is very high because viewers can change viewpoints freely. In the MPEG standardization body, 3DAV activity for natural three-dimensional video coding is in progress, and in that activity, standardization of free-viewpoint TV and multiview video coding is being considered [8, 16] Wiley Periodicals, Inc.

2 Fig. 1. Free-viewpoint video communications. In free-viewpoint video communication, it is assumed that the transmitting side captures a scene with multiple cameras and produces multiview video data, and then the receiving side generates and displays an image while changing the viewpoint (Fig. 1). On the transmitting side, it is assumed that all of the cameras are calibrated in advance, and that the camera parameters obtained from calibration are transmitted together with video bitstream. On the receiving side, after decoding the video bitstream, the image from the virtual camera position is generated by view interpolation techniques making use of the camera parameters. When the camera density is higher on the transmitting side, smoother change of viewpoints is achieved. In free-viewpoint video communications, multiview video coding with high coding efficiency is an essential technology. In addition, in multiview video coding, low-delay random access functionality for change of viewpoint is needed. This is because the views necessary for generation of views are changed. This paper presents research results focusing on multiview video coding. For multiview video coding, a coding method of multiview images which exploits the epipolar constraint and encodes the images as one video has been proposed [4]. However, this proposal has been limited to still objects, and has not been extended to moving objects. The MPEG-2 multiview profile uses a coding method for a stereo video in which one view is predicted from the other view, and by extending the method used in the MPEG-2 multiview profile, a coding method of multiple views using prediction between multiple views has been proposed [5]. In this method, view scalability was also proposed, achieving the change of views to be decoded from multiple views to a stereo or single view. However, this method does not provide low-delay random access to a requested view, because all views have the same GOP structure, that is, the number of pictures in a GOP is the same for all views. Moreover, because it is based on MPEG-2, it does not make use of the reference picture selection method adopted in H.264, in which the reference picture is selected from multiple decoded images. That is why high coding efficiency has not been achieved. In this paper, we propose a new multiview video coding method which achieves low-delay random access to a requested view with regard to change of viewpoint and view direction, while maintaining high coding efficiency. In this paper, the delay is calculated from the number of frames to be decoded in order to obtain the requested frame of the requested view. In the second section of the paper, we propose a multiview video coding method based on the reference picture selection method, which has low-delay random access functionality. In the third section, we propose a new disparity prediction method, where camera geometry information is used for coding of the disparity vector and determination of the search range of the disparity vectors. 2. Multiview Video Coding 2.1. Assumed camera arrangement and multiview video This section describes the assumed camera arrangement and structure of multiview video. To construct a free-viewpoint video, the cameras must be arranged densely. When the epipolar constraint is utilized for generation of a virtual view, it is better if the cameras are arranged regularly [6]. Figure 2 presents an example of the assumed camera arrangement. Figure 2(a) shows the structure in which the cameras are arranged in a line, and Fig. 2(b) shows the structure in which the cameras are arranged in an arc. In practice, because the cameras are arranged manually, some error in camera positions is present, and it is difficult to remove such errors at the pixel level before capture. We could correct such errors before encoding the video signals, using the camera parameters obtained by camera calibration, but this requires a huge amount of processing time because correction must be applied to all pixels of all cameras. Thus, a system with many cameras is particularly unsuitable for the communications application discussed in this paper. Not only errors in camera geometry but also color inconsistency is difficult to remove. Therefore, in this paper, we assume that errors at pixel level in the camera Fig. 2. Assumed camera arrangements. 15

3 positions and colors are removed by using the camera parameters on the receiving side when a virtual view is generated, and we assume that images that contain errors at pixel level are subject to encoding Proposal of GoGOP structure and GOP adaptive reference picture selection method In free-viewpoint video communications, not all views are necessarily decoded, because obtaining the requested view suffices. There are two ways to obtain partial data from a multiview video. The first is partial decoding, in which the receiving side decodes partial data after it receives all multiview data, and the second is view scalability, in which only the necessary data to obtain a request view are transmitted [9]. Figure 3 shows the flow between the transmitting side and the receiving side for these two methods. We have proposed the GoGOP (Group of GOP) structure to implement these methods [3, 10]. It extends the concept of GOP structure in conventional 2D video to multiview video. In the GoGOP structure, a view consists of several GOPs, and prediction coding is applied between GOPs. A GOP is categorized as either a Base GOP or an Inter GOP. In a Base GOP, the images can be decoded by using images in the same GOP, and in an Inter GOP, they can be decoded using images in other GOPs as well as the same GOP. In an Inter GOP, higher coding efficiency can be achieved than in a Base GOP, because the correlation between GOPs is utilized in the prediction coding. A GOP within a GoGOP is encoded using only GOPs in that GoGOP. Figure 4 shows examples of the GoGOP structure. A white square represents a picture in a Base GOP and a gray square represents a picture in an Inter GOP. In Fig. 4(a), a Base GOP and an Inter GOP are applied alternately in a view. The arrows show the reference relations of the pictures. Fine arrows in the figures show the relations of pictures; the picture positioned at the origin of the arrow refers to the picture positioned at the tip of the arrow. Thick arrows in the figures show the relations of GOPs; the GOP positioned in the origin of the arrow refers to the GOP positioned at the tip of the arrow. In this example, the picture in the Inter GOP refers to pictures in the GOP as well. Partial images from all views can be obtained even if Inter GOPs are not decoded. When the correlation of pictures in the time dimension or the interview dimension is high, the images in Inter GOPs can be generated from images in the Base GOPs, and thus all images can be obtained. It is possible to obtain a requested view. The number of delayed pictures is consistent with that of all pictures within a GoGOP at a maximum in the case shown in Fig. 4(a). In Fig. 4(b), an Inter GOP refers only to Base GOPs. Partial decoding and view scalability can be achieved in a GOP, as well as in the case shown in Fig. 4(a). An Inter GOP may contain multiple pictures, as Fig. 4(b) shows, or it may contain only one picture. In this case, the pictures in an Inter GOP can be decoded with low delay, while the pictures in Base GOPs are decoded. Thus, the number of delayed pictures is just one at a minimum in this case. This structure is efficient in terms of reducing processing time for decoding pictures, as well as reducing the memory size of reference pictures. In the GoGOP structure, a GOP number is assigned to each GOP. The relations of reference between GOPs are indicated by a reference GOP number, which is encoded in a GOP header. To determine the reference GOPs, camera arrangement information is useful. For instance, when the correlation between adjacent views is assumed to be high, GOPs taken where the camera positions are close together are selected as candidate reference GOPs. If a reference GOP also refers to another GOP, the delay of decoding a picture becomes high. When the acceptable size of such Fig. 3. Data flow in request for a view. Fig. 4. Examples of GoGOP structure. 16

4 delay can be set, reference GOPs must be determined so that the delay is not exceeded. This paper presents research results on the relationship between coding efficiency and the structure of the reference GOPs. We propose the GOP adaptive Reference Picture Selection (GRPS) method for GoGOP structure, extending the hierarchical reference picture selection method that has been proposed for temporal scalable video coding [11]. Multiple reference picture memories that are managed logically independently for each GOP are prepared, and the utilized reference picture memories are selected adaptively. Each reference picture memory assigned to a GOP has multiple decoded pictures, and a reference picture is selected from those pictures. Figure 5 shows the structure of the decoder of GRPS. The reference GOPs used for decoding are indicated by reference GOP numbers. The reference indices are assigned to the indicated reference GOPs, and the reference picture is selected according to the reference index encoded in the bitstream. In the proposed method, the reference picture is selected per block. For instance, in the case shown in Fig. 4(a), when GOP5 and GOP6 are set to the reference GOPs of GOP6, reference indices are assigned to pictures stored in the reference picture memories for GOP5 and GOP6, and then the pictures in GOP6 are decoded. The reference indices are not assigned to the picture in GOP4. To improve coding efficiency in the reference picture selection method, the coding mode and displacement vectors (motion vectors and disparity vectors), and the reference indices for block B are chosen so as to minimize the cost function defined by Eq. (1) [12]. o(i, j, g, t) represents the original image at position (i, j) in frame t of GOP g, and r(i, j, g, t) represents the decoded image. R represents the number of encoded bits for the block, and λ is the Lagrangian multiplier. The decoded image r(i, j, g, t) is given below, where p(i, j, g, t) is the predictive image, e(i, j) is the residual error, and (d x, d y ) is the displacement vector. h represents the difference in GOP numbers and s represents that of the frame numbers. The coefficients a and b are used for color correction when the GOP number is different Prediction error in multiview video coding The prediction error in multiview video coding is estimated by applying the ray space representation. A disparity compensation utilizing the features of the ray space for coding multiview images has been proposed [17]. Reference 17 discusses the case in which the standard plane is moved along the viewing angle in the ray space; however, this paper discusses the case in which the standard plane is moved along the viewing position, because the cameras are arranged in a line rather than in a circle. When we set the standard plane P shown in Fig. 6(a) for the camera arrangement in Fig. 2(a), the rays across the standard plane are represented in the ray space whose dimensions are the ray directions (θ, ϕ) and positions (x, y) [6]. The camera images correspond to multiple rays (real rays) in the ray space (2) (1) Fig. 5. Decoder configuration for GOP adaptive Reference Picture Selection (GRPS). Fig. 6. Samples transformed into ray space with time axis from real captured images. (a) Real camera sets; (b) samples in ray space; (c) samples in ray space with time axis; (d) relations of view angle and camera distance. 17

5 whose dimension is u and whose horizontal position is x, as shown in Fig. 6(b), where u is given by u = tan(θ) for horizontal angle θ if the vertical information is omitted. If the camera arrangement is temporarily fixed, they are arranged according to the time dimension shown in Fig. 6(c). The correlation of the real rays in the time dimension is dependent on the distance of the real rays t, and the correlations in the positional dimensions are dependent on the distances of the real rays x. Here the relationship of rays in the position and time dimensions is analyzed. The image o(i, j, g, t) captured by the cameras in Fig. 6(d) is transformed to the ray f(x, y, v, t) in the standard plane by the following: First, the errors in the positional dimensions of the real rays are discussed. The error E v in the standard plane for the current frame captured by camera v2 is given by Eq. (4) when the current frame refers to the frame captured by camera v1 at the same time: Region S c shows the covered area in which v2 and v1 overlap in the standard plane, and S u shows the uncovered area in which they do not overlap. The displacement vector (α, β) shows the position where the correlations are the highest in the frame of v2. The difference of f(x, y, v, t) and f(x, y, v 1, t) can be regarded as the difference for an angle change θ at position (x, y) in the standard plane. The angle change θ can be approximated by the position change x and the distance Z from the camera to the standard plane. The difference in the frame when we introduce the complexity M for the positional dimension in the real rays can be defined as a quantity that is dependent on the complexity M. Then E v, the average of error E v, is expressed as follows, where ρ v ( θ) is the average difference of the real rays in region S c for the angle change, and ρ a (M) is the average difference of the real rays in region S u for the position change in the frame: Next, errors in the time dimension of real rays are discussed. The error E t in the standard plane for the current frame captured by camera v2 is given by Eq. (6), when the current frame refers to the previous frame in the same camera: (3) (4) (5) The average error E t is given below, where ρ t ( t) is the average difference of the real rays in region S for the temporal change: (7) Based on the above analysis, the average error E _ among the real rays is obtained by averaging with some weighting of the errors in the positional dimensions E v and in the time dimension E t : (8) The coding efficiency can be improved if this average E _ is reduced. Thus, to improve coding efficiency by utilizing the correlation of views, the average error E v, should be reduced. Provided that ρ v ( x/ Z) is much smaller than ρ a (M), the average error E v can be reduced by decreasing the distance x of the real rays in the ray space. However, it is difficult in practice to make the distance of the real rays very small because a camera has a physical size. We propose to use the reference picture selection method to improve coding efficiency. For prediction of real rays to the positional dimension, not only the adjacent frame whose time stamp is the same, but other frames too are set as candidate reference pictures. By this scheme, the average error E v of the real rays is reduced. In addition, the reference picture selection method is also applied to the time dimension. Here the weighting coefficients w v and w t in Eq. (8) correspond to the selection ratios of disparity compensation and motion compensation, respectively. We earlier showed that coding efficiency could be improved by increasing the number of reference pictures when the frame rate was low for temporal prediction [11]. Moreover, to improve coding efficiency, the cost function J given by Eq. (1) must be minimized for all pixels to be encoded. The average error E _ corresponds to the SSD part in Eq. (1), and it is necessary to reduce it, but it is also necessary to reduce the number of bits R, e.g., for representing the disparity vectors Experimental results and discussion (without reference picture memory for Inter GOPs) The coding efficiency of multiview video with the GoGOP structure is dependent on the reference relations of GOPs. Thus, experiments were conducted to evaluate the coding efficiency of the GRPS method while changing the camera distances, the structure of the reference picture memories, and the size of the reference picture memories. The evaluation was carried out in terms of the number of (6) 18

6 bits and the PSNR of the current GOP. First, experimental results are presented for the case in which temporal prediction is not applied to Inter GOPs. The GoGOP structure corresponds to Fig. 4(b). In this case, an Inter GOP has multiple frames, but it does not have reference picture memories to store decoded images. Figure 7 presents an example of the time stamps of reference pictures, and in particular Fig. 7(b) illustrates the case in which one frame decoding delay from a Base GOP is allowed. Table 1 summarizes the test sequence conditions. Table 2 summarizes the encoding conditions. We examined two sequences whose camera arrangement differed. The sequences used were provided by the MPEG 3DAV group. The sequence Flamenco is provided as a KDDI test sequence for multiview video [7], and the camera arrangement corresponds to Fig. 2(a). The sequence Aquarium was provided by Nagoya University [8], and the camera arrangement corresponds to Fig. 2(b). Note that in this experiment, no color correction was performed. Figure 8 shows examples of the multiview video used. In the figure, the left corresponds to the camera at the left edge, the middle corresponds to the camera in the middle, the right corresponds to the camera at the right edge, the top is the first frame, and the bottom is the final frame. All GOPs consisted of the same number of frames, all reference GOPs were encoded as Base GOPs, and the quantization parameter (QP) was set the same as for the current GOP. The coefficients a and b in Eq. (2) were calculated by Eq. (9) for correction of colors. The coefficient a was the ratio for all pixels in frame F: (9) Table 1. Test sequences The proposed coding method was implemented in accordance with H.264, and color correction by coefficients a and b in Eq. (2) was carried out by weight prediction (WP) as specified in H.264. Figure 9 shows the PSNR when view 5 was encoded with the number of reference pictures equal to 1, using view 4 or 3 as reference GOPs for the sequence Flamenco, and also the PSNR when view 8 was encoded with the number of reference pictures equal to 1, and with view 7 or 6 as the reference GOP for the sequence Aquarium. In the figure, base denotes the case in which all frames were encoded as Intra frames, and GOPx denotes the case in which the view was encoded with view x the reference GOP. We see from these results that the coding efficiency is improved by encoding as an Inter GOP for both sequences, and in addition that it becomes higher when the camera distances are shorter. This is because the prediction efficiency is improved when the distance x of the real rays shown in Eq. (5) is small. Figure 10 shows the PSNR when view 5 was encoded using view 4 as the reference GOP, with the number of reference pictures being greater than 2 for the sequence Flamenco. We see from this figure that the coding efficiency is improved when the number of reference pictures Table 2. Encoding conditions Fig. 7. Positions of reference pictures for low-delay decoding. 19

7 Fig. 8. Examples of images used in experiments. Fig. 9. PSNR for different reference GOP in the absence of reference picture for current GOP. is three and one frame delay is allowed, compared with the case in which the number of reference pictures is one. When the number of reference pictures was two or when no delay was allowed for three reference pictures, a coding gain of just a few percent was obtained. For the sequence Aquarium, no coding gain was obtained by increasing the number of reference pictures. We see from these results that the coding efficiency is improved when the number of reference pictures is increased in two different directions in the time dimension (e.g., to the past and to the future), which is dependent on the features of the sequences. This implies that the prediction efficiency is improved when many adjacent real rays are used for prediction, as shown in Fig. 6(c). We also considered the coding efficiency when the number of reference GOPs was increased. Figure 11 shows the PSNR when the number of reference GOPs was two for both sequences. These results are for the case in which both sides of the views are set to the reference GOPs and the case in which one side of the views is set to the reference GOPs. In the former case, views 6 and 4 were set to the reference GOPs for the sequence Flamenco and views 9 and 7 were set to the reference GOPs for the sequence Aquarium. In the latter case, views 4 and 3 were set to the reference GOPs for the sequence Flamenco and views 7 and 6 were set to the reference GOPs for the sequence Aquarium. In the Fig. 10. PSNR for different numbers of reference pictures in GOP in the absence of reference picture for current GOP. 20

8 Fig. 12. Examples of decoded images for different reference GOPs Experimental results and discussion (with reference picture memory for Inter GOPs) Fig. 11. PSNR for different numbers of reference GOPs in the absence of reference picture for current GOP. figure v2_bidir denotes the results in the former case, and v2_unidir denotes those in the latter case. We see from these results that the coding efficiency is improved when the number of reference GOPs is increased and that it is better when both sides of the views are set to the reference GOPs. This is because the prediction efficiency is improved by an increase in the number of candidate reference pictures; in particular, when both sides of the views are set to the reference GOPs, the prediction error is reduced by a decrease of the region S u in which real rays do not overlap, as shown in Fig. 6(d). Figure 12 shows an example of decoded images of the sequence Aquarium for QP equal to 36; in this case view 8 was encoded with the GOP6 and GOP7 reference GOPs. As shown in the figure, there are no noticeable blocking artifacts or afterimages. The reason for the absence of blocking artifacts is essentially the effect of deblocking filtering. However, blurring is evident, especially around the algae area, when GOP6 is set to the reference GOP. This is because the prediction efficiency is decreased when the disparity is large. Next, experimental results are presented for the case in which reference picture memory for Inter GOPs is provided and temporal prediction can be selected even for Inter GOPs. Even in this case, when the GOP length in the time direction is small, relatively low-delay random access to a requested view is possible. The encoding conditions were the same as in Table 2. All the decoded pictures in the reference picture memory of the Inter GOPs were discarded before the first picture of the next GOP was encoded. Figure 13 shows the PSNR when view 5 was encoded as the Base GOP for the sequence Flamenco and when view 8 was encoded as the Base GOP for the sequence Aquarium. Figure 14 shows the PSNR when view 5 was encoded with view 4 or view 3 used as the reference GOP for the sequence Flamenco, and when view 8 was encoded with view 7 or view 6 used as the reference GOP for the sequence Aquarium. The number of reference pictures was two. We see from Fig. 13 for the sequence Aquarium that the coding efficiency is improved as the number of reference pictures is increased when the view is encoded as Base GOPs. However, it is not improved noticeably for the sequence Flamenco. The reason why it is improved for the sequence Aquarium is that the prediction efficiency is improved by an increase in the number of candidate reference pictures in the time dimension. It is surmised that for the sequence Aquarium the tendency is noticeable because the frame rate is low and the distance t of the real rays in the time dimension is large. We see from Fig. 14 that for both sequences the coding efficiency is improved when views are encoded as Inter GOPs, and that it increases at shorter camera distances. This is because the prediction efficiency is improved as the distance x of the real rays in Eq. (5) becomes smaller, similarly to the case discussed in Section 2.4. This tendency is noticeable for the sequence Flamenco. The 21

9 above results show that the coding efficiency is improved more when the number of reference pictures is increased for disparity compensation than when it is increased for motion compensation for the sequence Flamenco. The coding efficiency was also evaluated as the number of reference GOPs was increased. Figure 15 shows the PSNR when the number of reference GOPs was two for both sequences. Both sides of the views were set to the reference GOPs. Views 6 and 4 were set to the reference GOPs to encode view 5 for the sequence Flamenco, and views 9 and 7 were set to the reference GOPs to encode view 8 for the sequence Aquarium. The number of reference pictures for a GOP was two. We see from the results that the coding efficiency is improved as the number of reference GOPs is increased. This is because the number of candidate reference pictures is increased and prediction efficiency is improved by decreasing the region S u where real rays do not overlap, as shown in Fig. 6(d), similarly to the case discussed in Section 2.4. Figure 16 shows the reduction ratio of the amount of bits when view 5 was encoded as Inter GOPs for the sequence Flamenco. When view 4 was set to the reference GOP, the ratio was high, sometimes exceeding 50%, in the Fig. 13. PSNR for coding as Base GOP. Fig. 14. PSNR for different reference GOP when current GOP has two reference pictures. Fig. 15. PSNR for different number of reference GOPs when current GOP has two reference pictures. 22

10 3.2. Reference disparity vector prediction method If the camera arrangement is fixed, it is assumed that the change of disparity versus the change of time is small. Then we propose a reference disparity vector prediction method in which the disparity vector of the current frame is coded using the previous disparity vectors. The objective of this proposal is reduction of the number of bits for the disparity vectors. The disparity vectors in the first frame of the GOPs are stored and used for coding the subsequent disparity vectors. As in Fig. 17, the bitstream of the first frame is divided into disparity information and texture information. The disparity information contains disparity vectors and mode information such as the block partitioning pattern and intra/inter mode information. After decoding the first frame, the disparity vectors are stored in the memory as reference vectors (rdv x, rdv y ). In the successive frames, those reference vectors are loaded from memory and used for decoding of the disparity vectors. The disparity vector (dv x, dv y ) is derived from the reference vector (rdv x, rdv y ) and the differential vector (ddv x, ddv y ) by the equation (10) Fig. 16. Reduction ratio of bit numbers. first frame where temporal prediction was not applied, and the ratio was still noticeable, sometimes exceeding 10%, in the successive frames. We see that disparity compensation contributes to an increase of coding efficiency in the frames other than the first. However, the ratio for the other frames is quite small, sometimes less than one-fifth, compared with the ratio for the first frame. It is considered that there are large regions of real rays in which the error E t depending on the distance t of the real rays is smaller than the error E v depending on the distance x of the real rays for the sequence Flamenco. When reference disparity vector prediction is not used, the disparity vector is derived in the same way as the motion vectors, namely, the predictive vector is set to the intermediate value between the surrounding blocks disparity vectors and the disparity vector is obtained by adding the predictive vector and the differential vector. Figure 18 shows the results of the reference disparity vector prediction method. It shows the PSNR when view 8 was encoded while using view 7 or 3 or 1 as the reference GOP for the sequence Flamenco, and when view 15 was encoded while using view 14 or 7 or 1 as the reference GOP for the sequence Aquarium. In each experiment, only disparity compensation is applied. In the figure rdv denotes the results of the reference disparity vector prediction 3. Adaptive Disparity Compensation 3.1. Usage of camera arrangement In the earlier sections, we proposed a coding method to improve coding efficiency by decreasing the average error E _ in multiview video coding. In this section, we propose a coding method to reduce the number of bits of the disparity vectors. Especially for a structure intended to achieve low-delay random access to views, highly efficient disparity compensation is necessary. In this paper a fixed camera arrangement is assumed, and that condition is utilized for improving coding efficiency. Fig. 17. Bitstream structure of reference disparity vector coding. 23

11 Table 3. Bit number of disparity vector of first picture Fig. 18. PSNR for different reference GOP when applying reference disparity vector coding. method. The encoding conditions are the same as in Table 2. We see from the results that the coding efficiency is improved by the reference disparity vector prediction method, regardless of the camera distances. Table 3 shows the ratio of the number of bits for the disparity vectors in the first frame of the GOPs for the sequence Flamenco. The ratio is larger for smaller camera distances. This is because the prediction efficiency improves as the distances x of the real rays in Eq. (5) become smaller Adaptive disparity vector estimation method The direction of the disparity is often the same as that of the cameras. In this section, we propose an adaptive disparity vector estimation method to decrease the number of bits for the disparity vectors utilizing that feature. When a disparity vector is sought beyond the distances x of the real rays given by Eq. (5), a region S c where the real rays overlap as shown in Fig. 6(d) exists, and the prediction efficiency is improved. Therefore, the search range should be large, but a uniform increase in the search range increases the complexity. Thus, in the proposed method, the search range is determined from the above feature of the disparity vectors, and increased complexity is avoided by limiting the search accuracy. In the base disparity vector search (the BDS method), as in motion search by the JM method [13] of H.264, the following procedure is applied to the luminance information. (a1) The predictive disparity vector is derived. (a2) The disparity vector is sought with integer pel precision in the predetermined search range, and the derived predictive vector is set to the origin for the search. (a3) The disparity vector is sought at half-pel precision at the surrounding eight positions. (a4) The disparity vector is sought at quarter-pel precision at the surrounding eight positions. In the proposed adaptive disparity vector estimation method, the same search method as that used in the base (BDS method) or the extended method, but with the search range is doubled in the camera arrangement direction (the DDS method), is selectively applied. The judgment criterion for the difference of luminance is used to determine which search method is applied. Provided that the cameras are arranged in the horizontal direction, the search method is determined in accordance with the flow illustrated in Fig. 19, using the judgment standard DL obtained by Eq. (11). In Eq. (11), (L0 + L1)/2 is the average difference in luminance in the same search range as used in the base, and (L0 + L1 + L2)/3 is the average difference of luminance in the search range that is double the range used in the base: 24

12 (11) L0, L1, and L2 are calculated by Eq. (12). L0 is the average difference of luminance for the position where the current frame is the same as the reference frame (G = 0), L1 is that for the position where the current frame is shifted horizontally by R, that is, the search range of the base (G = R), and L2 is that for the position where the current frame is shifted horizontally by 2R, that is, the search range of the base (G = 2R). Region A is that used for calculating the difference of luminance, and Na is the number of pixels: (12) The following procedure is applied for the DDS method. (b1) The predictive disparity vector is derived. (b2) The disparity vector is sought with integer pel precision for the vertical direction and with two-pixel precision for the horizontal direction in the predetermined search range, and the derived predictive vector is set to the origin for the search. (b3) The disparity vector is sought with half-pel precision for the vertical direction and integer pel precision for the horizontal direction at the surrounding eight positions. (b4) The disparity vector is sought with quarter-pel precision for the vertical direction and half-pel precision for the horizontal direction at the surrounding eight positions. In the adaptive disparity vector estimation method, because the number of search positions is the same as the base search method BDS, the complexity with respect to the number of search positions is not increased. And in Eqs. (11) and (12) that are used for determination of the search method, the increase of complexity is negligible because it involves a maximum of three calculations of SAD. Note that the disparity vector is obtained at half-pel accuracy for DDS, and is coded as a half-pel accuracy vector. On the other hand, if the search range is simply extended, the complexity is greatly increased. The number of calculations of SAD is a measure of the complexity of the search. The processing time for the search constitutes about 80% of the encoding time for a frame. Therefore, in the simple extension of search range to double size, the number of calculations of SAD increases by a factor of 4, and the processing time to encode a frame is increased by a factor of 3.2. If the search range is extended after the direction of extension of the search range is determined, the number of calculations is still doubled, and the processing time to encode a frame is increased by a factor of 1.6. Figure 20 shows the results for comparison of the DDS and BDS methods. It shows the PSNR when view 8 was encoded with view 7 or 3 or 1 as the reference GOP for the sequence Flamenco, and when view 15 was encoded with view 14 or 5 or 1 as the reference GOP for the sequence Aquarium. In each experiment, the reference disparity vector prediction method was used. Figure 21 shows the results derived from Eq. (11). The encoding conditions were the same as for Table 2. We see from this figure that the coding efficiency is improved by extension of search range when the distance between cameras is large for the sequence Flamenco, but the coding efficiency is not improved for the sequence Fig. 19. Flow of determination of disparity search methods in the adaptive disparity vector search method. Fig. 20. PSNR for different reference GOPs, comparing BDS and DDS methods in disparity search. 25

13 3.4. Filter coefficients for disparity compensation As shown in the previous section, half-pel disparity compensation combined with the adaptive disparity vector estimation method provides better coding efficiency when the distance between cameras is large. Thus, in this section the filter coefficients used to generate images at half-pel positions are discussed. In H.264, which serves as the basis for this paper, the images at the half-pel positions are generated by a six-tap Wiener filter. The Wiener filter improves coding efficiency for high-definition images [14, 15]. On the other hand, a two-tap filter is used for the quarter-pel positions, which achieves low-pass filtering effects. Figure 23 shows the average ratio of the numbers of bits when a six-tap filter and a two-tap filter are used for the horizontal half-pel positions. The PSNR was almost the same in both cases. The average ratio ave_ratio of the number of bits was calculated by Eq. (13), where Num2 and Num6 are the numbers of bits for a two-tap filter and a six-tap filter, respectively. The results for the four QP values shown in Table 2 are averaged. Fig. 21. Value of DL. Aquarium. This is because the camera arrangement is in an arc and the extension of the region in which the real rays overlap on the standard plane in the ray space is small. Figure 21 shows the validity of the judgment standard using Eq. (11) to determine whether the search range is extended or not. Thus, the proposed adaptive disparity vector estimation method is effective. As additional experimental results, the DDS method and the method using horizontal quarter-pel disparity prediction were compared. In the latter method, horizontal quarter-pel search was carried out after step (b4) in the flow of DDS. In the figure qpel denotes the results of the latter method. We see from the results that the coding efficiency is higher at half-pel accuracy regardless of the distance between cameras. This is because the number of bits for disparity vector is smaller at half-pel accuracy. The coding efficiency in quarter-pel disparity compensation shown in Fig. 22 is lower than that in the BDS method, which does not extend the search range shown in Fig. 20. This degradation of coding efficiency is derived from the two-pel search in step (b2) of the DDS method. Fig. 22. PSNR for different reference GOPs when applying quarter-pel disparity compensation. 26

14 and the adaptive disparity vector estimation method improve disparity compensation when the distance between cameras is large. In free-viewpoint video communications, view scalability is necessary for changes of viewpoints by communication. A flexible coding rate control method and a mechanism to guarantee identity of the received view and the requested view are also needed, even when a round-trip delay exists between the transmitting and receiving sides [3]. Such a rate control method and communications protocol are items for further study. A decision method for choosing Base GOPs and reference GOPs is also a subject of further study, with the objective of achieving high coding efficiency in multiview video while the delay is kept within the tolerance level. REFERENCES Fig. 23. Average reduction ratio ave_ratio of bit number. We see from the results that there is a tendency for the coding efficiency to be higher in the case of a two-tap filter when the distance between cameras is large. Thus, the low-pass filtering effect is greater and the prediction efficiency is degraded by an increase of the distance x of the real rays in the ray space. 4. Conclusions (13) We propose the GoGOP structure, the GOP adaptive reference picture selection method (GRPS), the reference disparity vector prediction method, and the adaptive disparity vector estimation method to improve the coding efficiency of multiview video coding for free-viewpoint video communications. The GoGOP structure and GRPS achieve high coding efficiency, and low-delay random access of a view is provided. It is shown that the coding efficiency is improved by an increase in the number of reference pictures and an increase in the number of reference GOPs. It is also shown that the reference disparity vector prediction method 1. Tanimoto M, Fujii T. FTV Free viewpoint television. M8595 MPEG Klagenfurt Document, Tanimoto M. Free viewpoint television Using multiviewpoint image processing. J Inst Image Inf Telev Eng Japan 2004;58: (in Japanese) 3. Kimata H, Kitahara M, Kamikura K, Yashima Y, Fujii T, Tanimoto M. System design of free viewpoint video communication. CIT Hata K, Etoh M, Chihara K. Coding of multi-viewpoint images. Trans IEICE 1999;J82-D-II: (in Japanese) 5. Lim JE, Ngan KN, Yang W, Sohn K. A multiview sequence CODEC with view scalability. Signal Process Image Commun 2004;19: Fujii T, Kimoto T, Tanimoto M. Ray space coding for 3D visual communication. PCS 96 Vol. 2, p Kawada R. KDDI multiview video sequences for MPEG 3DAV use. M10533, MPEG Munich Document, Report on 3DAV exploration. N5878 MPEG Trondheim Document, Kimata H, Kitahara M. Framework on free-viewpoint video with shared memory video coding. M11232 MPEG Palma Document, Kimata H, Kitahara M, Kamikura K, Yashima Y. Multi-view video coding using reference picture selection for freeviewpoint video communication. PCS Kimata H, Kitahara M, Kamikura K, Yashima Y. Temporal scalable video coding with hierarchical reference picture selection method. Electron Commun Japan (Part III) 2006;89: Sullivan GJ, Wiegand T. Rate-distortion optimization for video compression. IEEE Signal Process Mag 1998;15:

15 13. Lim K-P, Sullivan GJ, Wiegand T. Text description of joint model reference encoding methods and decoding concealment methods. JVT-K049 JVT Munich Document, Girod B. Motion-compensating prediction with fractional-pel accuracy. IEEE Trans Commun 1993;41: Wedi T. Adaptive interpolation filter for motion compensated hybrid video coding. PCS2001, p 49 52, Kimata H. Movement on MPEG 3DAV toward international standardization of 3D video. Tech Rep Inf Process Soc Japan 2005, No. 23, 2005-AVM-48, p (in Japanese) 17. Fujii T, Kimoto T, Tanimoto M. Data compression of 3-D spatial information based on ray-space coding. J Inst Image Inf Telev Eng Japan 1998;52: (in Japanese) 18. Kimata H, Yashima Y, Kobayashi N. Time adaptive motion estimation method for software-based realtime video coding IEEE International Conference on Multimedia and Expo (ICME) Vol. 1, p AUTHORS (from left to right) Hideaki Kimata (member) received his B.E., M.E., and Ph.D. degrees from Nagoya University in 1993, 1995, and He joined Nippon Telegraph and Telephone Corporation (NTT) in 1995, and has been engaged in research on picture coding, error tolerance, and image communications systems. His research interest includes 3D video signal processing. He is currently a Senior Research Engineer at NTT Cyber Space Laboratories. Masaki Kitahara (member) received his B.E. and M.E. degrees in industrial and management systems engineering from Waseda University in 1999 and 2001 and joined NTT. He has been engaged in R&D of data compression for image-based rendering and H.264 encoding algorithms. His research interests include signal processing methods for 3D applications and video compression. Kazuto Kamikura (member) received his B.E. and M.E. degrees in electrical engineering from Tokyo Science University in 1984 and 1986 and joined Nippon Telegraph and Telephone Corporation (NTT). He has been engaged in research and development for video coding systems. His current research interests include digital image processing and video coding. He is currently a Senior Research Engineer, Supervisor of the Visual Media Communications Project at NTT Cyber Space Laboratories. Yoshiyuki Yashima (member) received his B.E., M.E., and Ph.D. degrees from Nagoya University in 1981, 1983, and In 1983 he joined the Electrical Communications Laboratories, Nippon Telegraph and Telephone Corporation (NTT), where he has been engaged in the research and development of high-quality HDTV signal compression, MPEG video coding algorithm, and lossless image coding system. His research interests also include pre- and postprocessing for video coding, processing of compressed video, compressed video quality metrics, and image analysis for video communication system. He is currently a Senior Research Engineer, Supervisor of the Visual Media Communications Project at NTT Cyber Space Laboratories. He has also been a visiting professor at Tokyo Institute of Technology since He was awarded the Takayanagi Memorial Technology Prize in He is a member of the IEEE Signal Processing Society, the Information Processing Society of Japan, IEICE, and the Institute of Image Information and Television Engineers of Japan (ITE). 28

16 AUTHORS (continued) (from left to right) Toshiaki Fujii (member) received his B.E., M.E., and D.Eng. degrees in electrical engineering from the University of Tokyo in 1990, 1992, and He is currently an associate professor in the Graduate School of Engineering of Nagoya University. His research interests include 3D image processing and 3D visual communications. Masayuki Tanimoto (member; Fellow) received his B.E., M.E., and D.Eng. degrees in electronic engineering from the University of Tokyo in 1970, 1972, and He joined Nagoya University and has been a professor in the Department of Electrical Engineering and Computer Science, Graduate School of Engineering. He received the Ichimura Award, TELECOM System Technology Award, ITE Niwa-Takayanagi Best Paper Award, and IEICE Achievement Award. He was a chairperson of the Technical Group on Communication Systems of IEICE and a councilor of IEICE and ITE. He was also the Vice President of ITE. He is a Fellow of IEICE and ITE. His current research interests include image communication, image coding, image processing, 3D images, and ITS. 29