Design of Efficient Algorithms for Image Compression with Application to Medical Images

Transcription

1 Design of Efficient Algorithms for Image Compression with Application to Medical Images Ph.D. dissertation Alexandre Krivoulets IT University of Copenhagen February 18, 2004

2

3 Abstract This thesis covers different topics on design of image compression algorithms. The main focus in this work is development of efficient entropy coding algorithms, development of optimization techniques for context modeling (with respect to the minimum code length) and application of the methods in the design of an algorithm for compression of medical images. Specifically, we study entropy coding methods based on a technique of binary decomposition of source symbols. We show that the binarization allows to fit a parametric probability distribution model to a source coding algorithm, e.g., arithmetic coding, reducing the number of coding parameters to that of the distribution model. Context modeling is an essential part of an image coding algorithm, which basically defines its compression performance. In the thesis, we describe a unified approach to this problem based on statistics learning methods and the minimum description length principle. In particular, we present a design of optimized models using context quantization and initialization techniques. The optimization allows to find a model, which yields the minimum code length for some set of training data samples. Entropy coding and context modeling methods are applied for developing a compression algorithm intended for medical images. The algorithm allows for progressive near-lossless coding and is based on lossy plus refinement layered approach. We show that this method results in a better compression performance and image quality for large distortion values compared with the recently adopted standard JPEG-LS for lossless and near-lossless image compression. We also investigate a possibility of image reconstruction with the minimum mean squared error criterion within the proposed framework. i

4 ii

5 Contents Contents iii 1 Introduction Motivation and main goals Previous work About this thesis Source coding and image compression Information sources Source coding Arithmetic coding Universal source coding Compression of images Source coding via binary decomposition Introduction The binary decomposition technique On redundancy of binary decomposition Binary decomposition of FSM sources Binary decomposition and universal coding Applications of binary decomposition Introduction Generalized two-sided geometric distribution Efficient coding of sources with the GTSGD Experimental results On redundancy of rice coding Summary Context modeling for image compression Introduction Context formation Context model optimization Context initialization Context quantization Summary iii

6 iv CONTENTS 6 Optimization in the JPEG2000 standard Introduction Context modeling in JPEG High-order context modeling Experimental results Summary Hierarchical modeling Introduction Two-stage quantization Tree-structured quantization Optimal tree pruning Experimental results Summary Compression of medical images Introduction Embedded near-lossless quantization Entropy coding of the refinement layers Experimental results Reconstruction with minimum MSE criterion Summary Bibliography 89 A Bit rates for test sets of medical images 95 B PSNR for test sets of medical images 117

7 Chapter 1 Introduction 1.1 Motivation and main goals Computer imaging plays a significant role in many areas ranging from consumer digital photo albums to remote earth sensing. The growing production of images and demands to their quality require high performance compression methods for efficient transmission, storage and archival. Most image compression algorithms can be viewed as consisting of an image transformation, which performs decomposition of an image into a sequence of descriptors, followed by entropy coding of the descriptors. Entropy coding, in essence, performs the compression. Prediction, discrete cosine and wavelet transforms are examples of image decomposition, whereas prediction errors and transform coefficients are examples of the descriptors. They constitute an information source for the entropy coder, which encodes the sequence of source symbols according to some source model. This model is normally designed off-line using some assumptions on the data to be coded. The better the model approximates statistical properties of the data, the higher compression can be achieved. The lower bound on the compression performance is defined by the entropy of the source. The design of efficient models is a task of primary interest for any compression algorithm. Most image compression algorithms use the universal coding approach, where the model has fixed structure and unknown parameters. The parameters are estimated on the fly during encoding. This approach allows to adopt the model to the data statistics, which may vary for different data. On the other hand, the price for that adaptivity is the increase of the code length due to the necessity of implicit or explicit transmission of information on the model parameters (the so called model cost). The higher the model order, the more parameters it involves, and the higher the model cost. Thus, there is a trade-off between the order of the model and the overhead data needed to specify the model parameters. The use of domain knowledge of the data allows for reduction of the overhead information. Finding optimal solutions to this problem is one of the goals of this thesis. In this connection, we investigate the technique of source coding via binary decomposition of symbols. The decomposition can be used as a mean for model cost 1

8 2 CHAPTER 1. INTRODUCTION reduction and/or for efficient model optimization. Another goal is the design of an efficient algorithm for compression of medical images. The design of such an algorithm differs from the design of general purpose image compression methods due to some specific requirements. The main concern in medical imaging is the quality of the reconstructed image. There are three kinds of image compression methods: lossless, lossy and near-lossless. Lossless compression methods perfectly reconstruct an image, but they compress data at the lowest degree. Lossy compression techniques are usually used for images intended for human observation in general-purpose computer systems. Such methods are based on the fact that the human visual system does not perceive the high frequency spatial components of the image, and those components may be removed without any visible degradation. Thus, introducing distortion in the reconstructed data, one can achieve higher compression, while preserving its visual information content. Lossy methods have the main advantage that they can compress images at the highest degree. However, the distortion of the data is undesirable in some applications, e.g., in image analysis, object recognition and some others. Near-lossless compression is a lossy image coding algorithm, where the distortion is specified by the maximum allowable absolute difference (the tolerance value) between pixel values of the original and the reconstructed images. The method allows for a rigid and robust control of errors while achieving reasonable compression performance. For that reason near-lossless compression seems to be an attractive method for medical images. In our work, we develop an efficient algorithm, which allows for progressive near-lossless compression up to the lossless mode. The extended functionality and makes the algorithm more suitable for real applications. Compression of medical images also allows for efficient use of the similarity of statistical properties of the data to design high-order source models and thus, to achieve better compression performance. 1.2 Previous work Image coding originates from the communication theory established in 1948 by Shannon in his famous paper [50]. Yet, it took years before the methods developed in this work could be used in practical algorithms for image compression. In the recent decades, image compression has been extensively studied by many researchers. The first lossy and lossless still image coding standard JPEG [33] appeared by the end of 1980 s and had a remarkable performance at that time. The standard uses a block discrete cosine transform (DCT). The DCT was shown to be very close to the optimal decomposition for the class of natural images, performing almost perfect decorrelation of image pixels, see, e.g., [35]. The standard uses an ad-hoc Huffman or arithmetic entropy coding of transform coefficients. The version with arithmetic coding exploits binarization of the coefficients and a simple heuristic context modeling. The JPEG standard remains the most popular image compression algorithm. Introduction of a (discrete) wavelet transform (DWT) in the middle of 1980 s [20,

9 1.3. ABOUT THIS THESIS 3 30, 26] launched a new era in image compression. Besides being good for decorrelation of image pixels, the new transform provides more functionality to the compression algorithm, such as embedded coding, where any initial part of the compressed bit stream can be used to reconstruct the image with the quality in proportion to the length of this part. A benchmark method called the embedded zerotree wavelet (EZW) coding was introduced in [51] and further developed in the SPIHT (Set Partitioning in Hierarchical Trees) algorithm [47]. The use of integer-to-integer (reversible) wavelet transform [9] allowed for progressive coding up to the lossless image reconstruction [8, 48, 3, 64]. Development of context modeling techniques and rate-distortion optimization lead to a remarkable improvement of compression efficiency of algorithms based on DWT. The most sophisticated methods are the high-order embedded entropy coder of wavelet coefficients (ECECOW) [64], embedded block coding with optimized trancation (EBCOT) [55], and the compression with reversible embedded wavelets (CREW) algorithm [69], just to name a few. The algorithm EBCOT was adopted as a basis for the new standard for still image coding JPEG2000 [56]. The best compression results for lossless image compression were obtained by algorithms based on predictive coding and context modeling of prediction residuals. For example, algorithm CALIC (Context-based, Adaptive, Lossless Image Codec) [68] uses a sophisticated context modeling and a special kind of prediction (it exploits non-linear, gradient adaptive predictor) to achieve the best compression performance among practical coders. The performance of CALIC turns out to be even better than the UCM (Universal Context Modeling) method [57] developed on the basis of a universal source coding algorithm with provable asymptotic optimality. The ideas from CALIC were employed in the LOCO-I (LOw COmplexity LOssless COmpression for Images) algorithm [59], which became the new standard for lossless image coding JPEG-LS [60]. The standard also supports near-lossless compression. Near-lossless compression first appeared in [10] as a method allowing for much better compression performance compared to the lossless coding at the price of a very small and controllable distortion of the image. It was further elaborated in [65] to achieve higher compression efficiency. The fist method allowing progressive coding was proposed in [5]. The major improvements in compression performance are due to development of new techniques for context modeling, such as context quantization [67], and the use of parametrized probability distribution models (LOCO-I). 1.3 About this thesis The thesis covers two major problems in image compression: entropy coding and context modeling. For efficient entropy coding we studied the use of a binary decomposition technique with application to coding of sources with parametrized distribution. For design of high-order context models we developed optimization methods based on training data samples. The proposed solutions are tested in a series of experiments and applied in the design

10 4 CHAPTER 1. INTRODUCTION of an algorithm for near-lossless compression primarily intended for coding medical images. The thesis is organized such that most chapters are self-contained is the sense that they cover different topics, yet in the same framework. The material of the thesis is based on the following papers: 1. A. Krivoulets, On redundancy of coding using binary tree decomposition, in Proc IEEE Int. Workshop on Inf. Theory, p. 200, Oct., Bangalore, India, (Chapter 3) 2. V.F. Babkin, A.G. Krivoulets, On coding of sources with Laplacian distribution, in Proc. of Popov s Society Conference, p. 254, May, 2000, Moscow, Russia. (in Russian) (Chapter 4) 3. A. Krivoulets, On coding of sources with two sided geometric distribution using binary decomposition, in Proc Data Compression Conf., p. 459, Snowbird, UT, Apr., (Chapter 4) 4. A. Krivoulets, Efficient entropy coding for image compression, IT University of Copenhagen, Tech. report TR , February, (Chapter 4) 5. A. Krivoulets, Fast and efficient coding of low entropy sources with two-sided geometric distribution, in Proc. of the 2nd European workshop on Advanced video-based surveillance systems, pp , Kingston, UK, Sep., (Chapter 4) 6. A. Krivoulets On redundancy of Rice coding, IT University of Copenhagen, Tech. report TR , September, (Chapter 4) 7. A. Krivoulets, X. Wu, and S. Forchhammer, On optimality of context modeling for bit-plane entropy coding in the JPEG2000 standard, In Proc. VLBV03 Workshop, pp Madrid, Spain, Sep., 2003, (LNCS 2849) (Chapters 5,6) 8. A. Krivoulets and X. Wu, Hierarchical modeling via optimal context quantization, in Proc. 12th International Conference on Image Analysis and Processing, pp , Mantova, Italy, Sep., (Chapter 7) 9. A. Krivoulets, A method for progressive near-lossless image compression, in Proc. ICIP2003, vol. 2, pp , Barcelona, Spain, Sep., (Chapter 8) 10. A. Krivoulets, Progressive near-lossless coding of medical images, in Proc. 3d International Symposium on Image and Signal Processing and Analysis, vol. 1, pp , Rome, Italy, Sep., (Chapter 8) A brief description of the chapters is as follows. Chapter 2 describes the main concepts of image compression algorithms and introduces the framework of our research. In this chapter, we describe basic principles and the place of entropy coding and context modeling in image compression algorithms.

11 1.3. ABOUT THIS THESIS 5 In Chapter 3, we discuss a technique of binary decomposition of source symbols as an efficient mean for entropy coding and present some properties of the technique. Some applications of the technique are presented and discussed in Chapter 4. In this chapter, we introduce a probabilistic model of sources, which often occurs in image compression, and describe efficient methods of coding of such sources using the binarization technique. Binarization allows to reduce the number of coding parameters, which is usually equal to the size of the source alphabet, to the number of probability distribution parameters, which normally is much lower. Binarization also simplifies context model optimization described in the next chapter. Chapter 5 presents basic principles of high-order context modeling for image compression. The chapter covers two main topics of the context model design: context formation and context model optimization. The latter is based on context initialization and quantization based on some prior statistics on the data. In Chapter 6, we describe an application of the high-order context modeling techniques developed in Chapter 5 to optimization of context models adopted in the JPEG2000 standard. The models are used in the bit-plane entropy coding of wavelet transform coefficients and basically define the compression performance of the standard. In this chapter we demonstrate almost optimality of the models adopted in JPEG2000 with the given context template. Chapter 7 extends ideas of the context quantization technique to build a hierarchical set of models intended for a better fit to the actual data. Finally, in Chapter 8, we present an algorithm for near-lossless compression intended for medical images, which allows for progressive coding and reconstruction. In the algorithm design, we exploited the methods developed in Chapters 3,4, and 5. We show, that the resulting algorithm allows for more efficient coding (in terms of both compression performance and functionality) than the recently adopted standard JPEG- LS for lossless and near-lossless image compression.

12 6 CHAPTER 1. INTRODUCTION

13 Chapter 2 Source coding and image compression In this chapter we introduce the basic theory, concepts and definitions of source coding and consider a general structure of image compression algorithms. This background will be used throughout the thesis. 2.1 Information sources By an information source we assume a mechanism generating discrete random variables u from a countable (often finite, but maybe infinite) set A = {a 0, a 1,...,a m 1 }. The set A is called the source alphabet and m = A defines the alphabet size (hereafter, denotes cardinality of a set or the length of a string). Let u t be a random variable generated at the time instance t. A string of source symbols of length n (a source message) is denoted as u n 1 = u 1 u 2 u 3...u n. The empty string u 0 1 is denoted as. Let An be the set of all possible messages of length n in the alphabet A: A n = {u n 1 }. The simplest source model generates an independent and identically distributed (i.i.d.) symbols according to the probability distribution P(A) = {p(a), a A}, which defines the model parameters (probabilities of source symbols). Such a model is called the memoryless source. The distribution P(A) does not depend on the past symbols. A general source model is a finite state machine (FSM) model, which is defined by the (finite) set of states S = {s}, the source alphabet A, the set of conditional probability distributions {P s (A), s S}, P s (A) = {p(a s), a A}, and the initial state s 1. At each instance of time t the source generates a symbol u t A and changes its state from s t to s t+1 according to the state transition rule s t+1 = F(s t, u t 1 ). (2.1) The function F( ) will be referred to as a model structure. The set of probability distributions {P s (A), s S} specifies the set of model parameters. A message u n 1 generated by an FSM source can be decomposed into S subsequences generated at the states s S, which are the i.i.d. sequences drawn according to the distribution P s (A). 7

14 8 CHAPTER 2. SOURCE CODING AND IMAGE COMPRESSION If the state s t+1 is uniquely defined by the previous state s t and the symbol u t generated at this state, i.e., s t+1 = F(s t, u t ), (2.2) then the model is called a Markov source. The structure function F( ) of a Markov source can be described by a directed graph of state transitions, where nodes define the states and edges correspond to the source symbols and define state transitions. Let P ij be the probability of entering the state s = j from the state s = i: P ij = Pr(s t+1 = j s t = i). (2.3) If the state transition rule F( ) is such that the probability (2.3) depends only on the previous state, i.e., Pr(s t+1 s t, s t 1,...) = Pr(s t+1 s t ), then the sequence of the states form a homogeneous Markov chain. If the state is uniquely defined by the o last source symbols u t t o+1, then the source is called an o-order Markov chain. For the o-order Markov chain model we have: and where p(a s) = p(a u t t o+1 ) s t+1 = F(u t t o+1), F(u t t o+1 ) : ut t o+1 s {1, 2,..., Ao }. The FSM model is a good approximation to most practical sources, therefore it is widely used in compression algorithms. The main property of an information source is its entropy defined as where and p(u n 1 ) is the probability of the string un 1. 1 H = lim n n H(An ), (2.4) H(A n ) = p(u n 1 ) log 2 p(u n 1 ), (2.5) p(u n 1 ) An 2.2 Source coding Let B = {0, 1} be a binary alphabet and B be a set of all words in the alphabet B. Source coding is a mapping, which to each source message u n 1 A n assigns a codeword ϕ(u n 1 ) B such that u n 1 can be uniquely reconstructed. A set of codewords {ϕ(u n 1 ), un 1 An } is called a prefix code if no one codeword is a prefix of any other in the set. The use of prefix code guarantees the unique decipherability.

15 2.2. SOURCE CODING 9 The codeword lengths ϕ( ) of any set of prefix codes must satisfy the Kraft inequality: u n 1 An 2 ϕ(un 1 ) 1. (2.6) Conversely, if the set of codeword lengths satisfies (2.6), then there exists a prefix code with these codeword lengths. The set Q(A n ) = {q(u n 1 ) = 2 ϕ(un 1 ), u n 1 An } is called the coding probability distribution on A n. Since the codeword lengths are of main concern, this distribution is often useful in the source coding analysis. Given the source model, the key challenges in source coding are the choice of the codeword lengths and the construction of the codewords. Normally the codewords are chosen to minimize the description length ϕ(u n 1), meaning data compression. Thus, the terms source coding and data compression are often interchanged in the literature, even though source coding has a broader meaning. In the future, we will follow the tradition and use both terms in the same sense. Let L = p(u n 1 ) ϕ(un 1 ) (2.7) u n 1 An be the average code length for the set A n. The source coding theorem ([16], Theorem 3.3.1) establishes the lower bound on L for lossless coding L H(A n ), (2.8) with equality iff ϕ(u n 1) = log 2 p(u n 1). The theorem also states that there exists a prefix code such that L < H(A n )+1. Thus, the minimum possible code length for the message u n 1 is defined by ϕ(u n 1) = log 2 p(u n 1). (2.9) This quantity is called the self-information of the message u n 1. This is the basic idea of source coding: data compression is possible by assigning shorter codewords to more probable messages (symbols) and longer codewords to less probable messages (symbols). Maximum compression is achieved by choosing the codeword lengths equal to minus the logarithm of the probability of a message. The main property of a code is its redundancy. There are different redundancy measures defined in source coding. The two measures, which will be used later in the thesis, are the average redundancy and the individual redundancy R a = L H(A n ), (2.10) R i = ϕ(u n 1) + log 2 p(u n 1), (2.11) respectively. The task of a codeword construction can be solved by using arithmetic coding.

16 10 CHAPTER 2. SOURCE CODING AND IMAGE COMPRESSION 2.3 Arithmetic coding Arithmetic coding is a method for sequential calculation of the codeword ϕ(u n 1) for the source message u n 1. It is based on unpublished Elias algorithm described by Abramson [2] and Jelinec [22]. First practical implementations are due to Rissanen [43] and Pasco [32], who solved the finite precision problem and Witten et al. [61] who made it popular by publishing the C-code of their implementation. In arithmetic coding, the codeword is recursively calculated as a cumulative coding probability of the string u n 1: ϕ(u n 1 ) = a<u 1 q(a ) + q(u 1 ) a<u 2 q(a u 1 ) + + q(u n 1 1 ) a<u n q(a u n 1 1 ), (2.12) where {q t (a u t 1 ), a A, t = 0, 1,..., n 1} is a sequence of conditional coding probability distributions satisfying q t (a u t 1) = q(ut 1 a) q(u t 1), (2.13) such that q t (a u t 1) 1, (2.14) a A n 1 q(u n 1 ) = t=0 q(u t+1 u t 1 ). (2.15) The calculations are assumed to have infinite precision resulting in ideal arithmetic coding. The main property of ideal arithmetic coding is the following theorem, which is a slightly modified version of Theorem 1 from [52]. Theorem 1. Given a sequence of coding distributions {q t (a u t 1), a A, t = 1, 2,..., n} satisfying (2.13), (2.14), an arithmetic coder achieves codeword lengths The codewords form a prefix code. ϕ(u n 1) < log 2 q(u n 1) + 2, u n 1 A n. (2.16) In practice, calculations are performed with finite precision. In this case, estimates on upper bounds of the coding redundancy are defined in the following theorem. Theorem 2. Let r bits be used in a binary representation of the coding probabilities q(a ), a A, and g r + 2-bit registers for calculations. Then R a mn(r + log 2 e)2 (g 2) + 2, (2.17) R i g + (n 1)2 r g. (2.18)

17 2.4. UNIVERSAL SOURCE CODING 11 ENCODER DECODER u ϕ(u) u Arithmetic Arithmetic coder coder q(u t+1 u t 1) q(u t+1 u t 1) Source model Source model Figure 2.1: A block-chart of source coding using arithmetic coder. This theorem essentially is a compilation of Theorem 4 form [46], which establishes the bound for R a, and the estimate on R i given in [41]. It is clear from the theorem, that even with finite precision arithmetic implementation, the coding redundancy is usually negligible. Using arithmetic coding, we can separate the problem of assigning the coding probabilities q(u t+1 u t 1 ), according to some chosen criterion (different criteria will be considered in the next section), from the codeword construction. Source coding using arithmetic coder is schematically represented on Figure 2.1 [1], where the source model does the job of sequential assigning the coding probabilities q(u t+1 u t 1 ) and the arithmetic coder performs sequential calculation of the corresponding code word. 2.4 Universal source coding If the source model is known, then the choice of the coding distribution q(a ) = p(a ), a A allows to calculate a codeword for the message u n 1 using arithmetic coding exceeding the ideal by at most 2 bits (disregarding the arithmetic precision problem). However in practice, the source parameters (or even the underlying source model) are usually not known in advance or may vary for different data. In universal coding, it is assumed that the source belongs to some predefined set of models Ω = {ω}. A code is designed to perform well for all, or most of the models in the set. The set can be just a parametric probabilistic set of sources having the same model structure (e.g., Markov source with the same state transition rule) or may be a double mixture of sources with different structure and a set of parameters [11]. Let ϕ Ω (A n ) B be a prefix code on A n used for all sources in the set Ω and R{ϕ Ω (A n )} be a measure of performance (redundancy) of the code such that R{ϕ Ω (A n )} 0, (2.19) with equality when the set Ω contains only a single element, i.e., when the source is

18 12 CHAPTER 2. SOURCE CODING AND IMAGE COMPRESSION known (the source structure and the parameters are given). Thus, the redundancy of universal coding is only due to the lack of knowledge about the source. Let ϕ Ω = arg inf R{ϕ Ω(A n )} (2.20) ϕ and R = R{ ϕ Ω (A n )}. (2.21) The code ϕ Ω is said to be universal w.r.t the model set Ω if R 0 (2.22) as n [11]. There are two main measures used in universal coding, which are based on the average and individual redundancy. Let and R n a(ϕ Ω, ω) = p(u n 1 ω) An p(u n 1 ω) ϕ Ω (u n 1) H(A n ω) (2.23) Ri n (ϕ Ω, ω) = ϕ Ω (u n 1 ) + log 2 p(u n 1 ω) (2.24) be the average and individual redundancy of the code ϕ Ω for the source ω, respectively. Then the measures of universal coding are defined by the maximum average and individual R a {ϕ Ω (A n 1 )} = sup n Rn a(ϕ Ω, ω) (2.25) ω Ω R i {ϕ Ω (A n 1 )} = sup n Rn i (ϕ Ω, ω) (2.26) per-symbol redundancy. In general, the use of different criteria results in different codes. The criterion (2.26) was first proposed in [53, 54]. Clearly, it is stronger than (2.25) and a code ϕ Ω with good properties according to the criterion (2.26) also implies good behavior w.r.t. the criterion (2.25). The lower bound on the convergence (2.22) w.r.t. to both criteria for the parametric sets of memoryless and FSM sources is defined by [44, 52] ω Ω R K 2n log 2 n + O ( ) 1, (2.27) n where K is the number of free parameters. For memoryless sources K = m 1, and for FSM sources with S states K = S (m 1) (m is the alphabet size). Coding of an FSM source using the sequential coding probability distribution q t+1 (a s) = ϑ(a uts 1 (s)) + α a (s) t s + a A α a(s), (2.28)

19 2.5. COMPRESSION OF IMAGES 13 combined with arithmetic coding, where ϑ(a u ts 1 (s)) is the number of occurrences of the symbol a in the subsequence u ts 1 (s) corresponding the state s, and t s is the length of the subsequence, allows to achieve the optimal convergence rate (2.27). The values {α a (s) > 0, a A, s S} define prior distributions on the source parameters {P s (A), s S}. If nothing is known about P(A) s, then the best choice is α a (s) = 1, a A, s S, see, e.g., [23, 54]. 2 For memoryless sources, (2.28) is reduced to 2.5 Compression of images q t+1 (a) = ϑ(a ut 1 ) + α a t + a A α. (2.29) a A gray-scale digital image 1 is a 2-dimensional array of bounded integer values (image pixels) v[y, x], 1 x X <, 1 y Y <, where y and x define the row and the column coordinates, respectively. The values X and Y define the size of an image. The range of pixel s values is normally defined in terms of the number of bits k required to represent all image pixels. Thus, for a k-bit image, the pixel s value lies in the range 2 v[y, x] [0, 2 k 1]. Most general purpose images use 8-bit representation. Medical images often use bit representation. An example of an image is shown on Figure 2.2, where the small squares represent pixels and the grey scale reflects the pixel s value. In its original representation, an image takes X Y k bits. Image compression aims at reducing this number by using source coding techniques. An image compression system can be described in terms of four functional blocks: image transformation, quantization, encoder model (not to be confused with the source model introduced in Section 2.2), and entropy coder, as it is shown on Figure 2.3, where the mandatory part is the entropy coder. (In general, the image pixels can be fed directly to the entropy coder, however, in most applications this approach would lead to inferior compression performance due to significant spatial correlation of the image pixels.) The image transform part converts an image into a sequence of descriptors being another (abstract) representation of the image. The most used transformation techniques are prediction and a discrete orthogonal transform. The transformations use the fact that image pixels typically have substantial correlation with their neighbors. Predictive coding exploits an auto-regressive model, where the weighted past image pixels is used to predict the value of the next pixel. The difference values (residuals) between the estimated (predicted) and the original pixel values define the sequence of descriptors. The idea of an orthogonal transformation is that the image is represented as a linear combination of some basis functions (waveforms). The transform coefficients are the weighting factors to these functions. If the transform coefficients are real values, then quantization of the coefficients is a necessary step of image coding. 1 In the thesis, we will deal only with 2-D images. 2 Without loss of generality, we assume that pixels can take only non-negative values.

20 14 CHAPTER 2. SOURCE CODING AND IMAGE COMPRESSION x v[y, x] y Figure 2.2: Example of a gray-scale 8-bit image of size pixels. The zero-order entropy of prediction residuals and (quantized) transform coefficients is much lower than that of image pixels. Furthermore, using orthogonal transformation, an image can be represented by a few transform coefficients. All that significantly reduces the amount of storing or transmitting data. The most used kinds of orthogonal transformation for compression are the discrete cosine and wavelet transforms (DCT and DWT, respectively) [36]. Basis functions of the DCT are asymptotically optimal for decorrelation of the 1-st order 1-dimensional stationary Markov process. Nevertheless, 2-D separable transformation still has a good decorrelation property and allows for a high compression capability (retaining most information in a few transform coefficients). There exist algorithms for fast calculation. The DCT is used in many image and video compression algorithms. Examples are the still image coding standard JPEG (Joint Picture Expert Group) [33], and the video coding standard MPEG (Motion Picture Expert Group) [25, 49]. The wavelet transform has a superior energy compaction due to comparatively short basis functions. It fits better to a non-stationary signal like real images. Yet, the main advantage of using DWT is its inherited ability for multiresolution representation of the signal, which adds an additional useful functionality to a compression algorithm. One more benefit of the DWT is that there are reversible integer-to-integer transformation algorithms [9]. Using embedded quantization of the transform coefficients,

21 2.5. COMPRESSION OF IMAGES 15 Original Image Encoder Compressed Data Encoder Image Transform Quantization Descriptors Encoder Source Entropy Model Symbols Coder Figure 2.3: An image compression system. such a transformation allows for an efficient progressive quality and spatial resolution representation up to perfect reconstruction of the original image. The DWT is exploited in the new standard for compression of still images JPEG2000 [56]. Objective and subjective tests show that the DWT achieves higher compression performance then the DCT. However, the DWT is normally more time and memory consuming. The transformation produces almost uncorrelated data and makes further quantization step more efficient, ultimately resulting in a better compression. Quantization of the descriptors (prediction errors, transform coefficients) introduces distortion into the reconstructed image and results in lossy compression. Lossy compression allows for higher compression rates than the lossless coding. Using an appropriate quantization rule, the quality of the reconstructed image and the compression rate can be efficiently managed and controlled. The theoretical principles of such a trade-off are given in the rate-distortion theory, see, e.g., [7, 16]. Quantization of the DCT or DWT transform coefficients yields an approximation of the original image in L 2 sense. Quantization of prediction errors results in the so called near-lossless compression, where the distortion is specified by the maximum absolute difference between the original and the reconstructed image pixels. In the literature it is also called L constrained lossy coding [65]. The encoder model maps the descriptors into source symbols for the entropy coder. It may be done explicitly or implicitly. An example of explicit mapping is the (old) JPEG standard. In JPEG, the sequences of quantized high-frequency transform coefficients are converted into blocks, which are then coded using Huffman code. Bit-plane coding of wavelet transform coefficients in the JPEG2000 standard is an example of implicit mapping, where the sequence of binary symbols is an information source for the entropy coder. In some coders the descriptors themselves (transform coefficients or prediction residuals) constitute input symbols for the entropy coder.

22 16 CHAPTER 2. SOURCE CODING AND IMAGE COMPRESSION The transformation alone does not perform the data compression. On the contrary, it often performs an expansion of the data, since the transform coefficients, even quantized, may take more bits for their representation than the pixels themselves. The encoder model just translates the descriptors into a more convenient representation for the entropy coder and it does not change the amount of description information. The compression is performed solely by the entropy coder. That is why developing an efficient entropy coder is an important step in the design of an image compression algorithm.

23 Chapter 3 Source coding via binary decomposition In this chapter, we study source coding using binary decomposition of source symbols and derive some properties. 3.1 Introduction In Chapter 2 we introduced a general structure of image compression algorithms consisting of image transformation and entropy coding of description symbols (descriptors). The sequence of descriptors constitutes an information source for the entropy coder. In general, the definition of the information source in compression algorithms (the encoder model) is left to the algorithm designer. It can be a sequence of the descriptors itself. The source symbols can also be represented by blocks of the descriptors, like it is implemented in the JPEG standard for coding the AC coefficients of the DCT [33]. The grouping of symbols is called alphabet extension [42]. It is commonly used in combination with a Huffman coding. However, it was shown in [42] that any source can be coded without alphabet extension using an arithmetic coder. An information source can also be reduced to a binary source via binary decomposition of source symbols. Binary decomposition combined with a binary arithmetic coding is a well known method for coding m-ary sources, see, e.g., [34, 24, 21]. The use of binary decomposition of (non-binary) source symbols has a number of advantages over conventional m-ary coding. Binary arithmetic coding is much simpler for hardware and software implementations than an m-ary arithmetic coding, especially if m 2. On the other hand, one has to encode more than one binary event for each input symbol. However, using an appropriate tree for decomposition and a binary coding technique, the method may result in a faster and/or more efficient entropy coding. Another advantage of using binarization is the possibility of easy optimization of context models for conditional entropy coding, as will be described in Chapter 5. In this chapter, we formally introduce the technique and present some properties. 17

24 18 CHAPTER 3. SOURCE CODING VIA BINARY DECOMPOSITION 3.2 The binary decomposition technique Let us suppose a memoryless source with an alphabet A = {a 0, a 1,...,a m 1 }, and a probability distribution of source symbols P(A) = {p(a), a A}. Let B = {1, 0} denote the binary alphabet and let A n, B n be sets of all words of length n in the alphabets A and B, respectively. As usually, a source message of length n is denoted as u n 1 = u 1u 2...u n, u A. Let T m be a set of proper and complete binary trees with m terminating nodes (leaves) χ 0, χ 1,..., χ m 1 and Λ = {η j, j = 0, 1,...,m 2} be a set of m 1 internal nodes of a tree τ T m. Let the tree τ be assigned to the source A in such a way, that to each symbol a k, k = 0, 1,..., m 1, of the source, there corresponds a leaf χ k of the tree τ: τ : a k χ k. (3.1) Then the source symbol a k is represented by a string of binary decisions, which is a path from the root to the leaf χ k. The m-ary (memoryless) source, generating a message u n 1 of length n in the alphabet A, can now be considered as a binary Markov source modeled by the tree τ generating a binary sequence b 1 b 2...b n = b n 1 of length n, where b B. Each node η Λ Bn of the tree corresponds to a state of the Markov source and the tree uniquely defines a corresponding directed graph of state transitions. The initial state is defined by the root node. At each state (node), the source generates symbol 0 or 1 with the probability distribution p(0 η) and changes its state (maybe to the same state). The distributions {p(0 η), η Λ} are uniquely defined by the probability distribution of the alphabet P(A). Such a binary source can be encoded by conventional methods of coding of Markov sources. Given the model and the initial state, the binary sequence b n 1 is decomposed into m 1 subsequences {b n η 1 (η), η Λ}, generated at each state. The subsequences in essence are the sequences of i.i.d. binary symbols drawn with the probability p(0 η). They can be effectively encoded using some kind of binary coding techniques, e.g., arithmetic coding [61] or Golomb run-length coding [19]. (Note, that for an arithmetic coder there is no need for an explicit decomposition of the binary sequence into the subsequences corresponding to the states. The arithmetic coder just uses the conditional probabilities p(0 η) to calculate a codeword for the whole binary sequence.) The main parameter of a decomposition tree is the average number of binary coding operations per source symbol n = a A p(a)n(a), (3.2) where n(a) is the number of binary decisions required to code a (the length of the binary path to the symbol a). This parameter defines the coding efficiency (in terms of both redundancy and speed, as will be discussed in the following sections). The minimum n is achieved when the tree is a Huffman tree for this source. The use of a

25 3.3. ON REDUNDANCY OF BINARY DECOMPOSITION 19 binary decomposition assumes that the decomposition tree is fixed 1 during encoding, and compression is performed by a binary coding technique. 3.3 On redundancy of binary decomposition In this section, we derive an upper bound estimate on the coding redundancy using the binarization technique. We define a redundancy R as the difference between the average code length per source symbol and the entropy of the source: R = 1 p(u n 1 n ) ϕ(un 1 ) H, (3.3) u n 1 An where ϕ(u n 1 ) denotes the codeword length for the message un 1 (see Chapter 2) and H = a A p(a) log p(a). Let us allow the use of different binary coding techniques for each node. (For example, if some subsequence has probability of a binary symbol close to 0.5, then, for the sake of simplicity of implementation, this subsequence may be put directly into the output bit stream). If the probabilities of source symbols are unknown, then the subsequences can be coded using some kind of universal or adaptive coding technique (e.g., binary arithmetic coding with the symbol probability estimate given by (2.29). Let n η be the length of a binary sequence, generated at the state η. We define an average coding redundancy per binary symbol at the state η as ρ η = 1 n η b n η 1 Bn η p(b n η 1 ) ϕ b (b n η 1 ) H η, (3.4) where ϕ b (b n η 1 ) is the code length of a binary string bn η 1, generated with the probability p(b n η 1 ) and H η is the entropy of binary symbols generated at the state η: H η = p(0 η) log 2 p(0 η) (1 p(0 η)) log 2 (1 p(0 η)). The following theorem establishes the relationship between the decomposition tree τ, the (redundancy of) coding techniques used at the nodes, and the resulting redundancy R. The theorem also helps to pose some interesting properties of the method, which are stated as Corollaries 1 4. Theorem 3. Let the source generate a message of length n of symbols from an alphabet A = {a 0, a 1,...,a m 1 }. Let this message be sequentially coded using the binary tree decomposition technique described above. Then the redundancy is defined as R = n η Λ π η ρ η, (3.5) 1 Otherwise, one would come to a (dynamic) Huffman coding technique and there would be no need for binary coding (at least for sources with an entropy larger than 1 bit).

26 20 CHAPTER 3. SOURCE CODING VIA BINARY DECOMPOSITION where n and ρ η are defined by (3.2), (3.4) and π η = n η / nn can be viewed as a probability of the state η. Remark: In this formula, n and π η are defined by the decomposition tree, whereas ρ η is determined by binary coding techniques, used at the nodes. Proof. The proof is straightforward. We rewrite (3.5) as follows: = n η Λ π η = n η Λ = 1 n π η 1 n η η Λ 1 n η R = n η Λ π η ρ η b n η 1 Bn η b n η 1 Bn η b n η 1 Bn η p(b n η 1 ) ϕ b (b n η 1 ) + H η p(b n η 1 ) ϕ b (b n η 1 ) n η Λ π η H η p(b n η 1 ) ϕ b (b n η 1 ) n η Λ π η H η The first term in the last equality is the average code length and the second defines the entropy of the input source. Corollary 1. If for η Λ ρ η = ρ, then R = nρ. Corollary 2. If r-bits registers are used to represent probabilities and an arithmetic coder with registers of size g r + 2 is used to code binary symbols at the nodes, then R < 2 n(r + log 2 e)2 (g 2) + 2/n. (3.6) Proof. It was proven in [46], that the per-symbol redundancy for an m-ary arithmetic coder is upper bounded by the inequality ρ < m(r + log 2 e)2 (g 2). (3.7) For binary coding m = 2, and two additional bits are required to terminate the message. This yields (3.6). Corollary 3. If the decomposition tree is a Huffman tree, r-bits registers are used to represent probabilities and a binary arithmetic coder with registers of size g r + 2 is used to code at the nodes, then R < (H + 1)(r + log 2 e)2 (g 2) + 2/n (3.8)

27 3.4. BINARY DECOMPOSITION OF FSM SOURCES 21 Proof. The proof follows immediately from the fact that for the Huffman tree n < H + 1. (3.9) This corollary is essential for coding low entropy sources (H < 1) with large alphabets (m 2). In this case, the use of a Huffman tree results in the redundancy of order O(H + 1), independent on the alphabet size, whereas for an m-ary arithmetic coding the redundancy is of order O(m), see (3.7). Finally, it is always possible to choose such a tree, that n log 2 m. Corollary 4. If the decomposition tree is such that n log 2 m, r-bits registers are used to represent probabilities and a binary arithmetic coder with registers of size g r + 2 is used to code at the nodes, then R < 2(log 2 m + 1)(r + log 2 e)2 (g 2) + 2/n Thus, for sources with an alphabet of size m > 4, the upper bound on the coding redundancy using binary decomposition combined with binary arithmetic coding is lower than that of m-ary arithmetic coding (see (3.7)). 3.4 Binary decomposition of FSM sources An alphabet binarization can be as well used in coding FSM sources. Let an FSM source be defined by the alphabet A, a set of states S = {s}, the initial state s 1 and a state transition rule. In this case, each state is assigned a decomposition tree τ s T m with the corresponding set of states (decision nodes) Λ s = {η s }. In general, the decomposition trees τ s can be different (and may be also optimized) for each state. At each state the source generates a binary string of decisions corresponding to a source symbol. The resulting binary sequence can be modeled as a double FSM source with the set of states, which is the product of the sets: S = Λ s S. 3.5 Binary decomposition and universal coding Consider a memoryless m-ary source. If the source parameters P(A) = {p(a), a A} are not known, then the subsequences of binary decisions can be coded using a universal coding approach. In this case, the decision probabilities can be estimated according to (2.28), where s corresponds to a node η. The use of a binary decomposition technique does not increase the stochastic complexity of the source model, since the number of parameters is retained the same. An m-ary memoryless source and its binary counterpart have m 1 free parameters (in the latter case this is the number of internal nodes of the decomposition tree). This concerns FSM sources as well, where the number of parameters is (m 1) S.

28 22 CHAPTER 3. SOURCE CODING VIA BINARY DECOMPOSITION Thus, binarization of source symbols at least does not make a compression algorithm worse in terms universal coding redundancy (2.25) or (2.26) (due to the lack of knowledge on the parameters). Asymptotically this redundancy is the same. Yet, for finite length messages, the redundancy may be even less, resulting in a better compression performance. We will show it in the following example. Let us assume a memoryless m-ary source, and let the output message u n 1 of length n > 0 be coded by an ideal m-ary arithmetic coder using adaptive coding probability calculation defined by (2.29) with the parameter vector α a = 1, a A. Then, the 2 code length is defined as ϕ(u n 1 ) = Γ ( ) m 2 π m 2 Γ ( n + m 2 = n H + m 1 2 ) ( Γ ϑ(a u n 1 ) + 1 ) 2 a A log 2 n + O 1 (1). where Γ( ) is a Gamma function, ϑ(a u n 1) denotes the symbol counts and (3.10) H = a A ϑ(a u n 1) n log 2 ϑ(a u n 1) n (3.11) is the empirical entropy of the message. Let the same message be coded using the binary decomposition technique combined with an ideal binary arithmetic coder. Then the code length is the sum of the codes of binary sub-strings, corresponding to the decomposition tree nodes: ϕ(u n 1) = η Λ ϕ(b nη 1 (η)) = η Λ ( n η Hη + 1 ) 2 log 2 n η + O η (1). (3.12) Thus ϕ(u n 1 ) = n η Hη log 2 n η + O 2 (1) (3.13) η Λ where n η denotes lengths of the binary sub-strings, H η = ϑ(0 unη 1 ) n η log 2 ϑ(0 u nη 1 ) n η η Λ ϑ(1 unη 1 ) n η is the empirical entropy of the binary sub-strings and log 2 ϑ(1 u nη 1 ) n η (3.14) O 2 (1) = η Λ O η (1). (3.15) It can be shown that n H = η Λ n η Hη. (3.16)

29 3.5. BINARY DECOMPOSITION AND UNIVERSAL CODING 23 The second term in (3.13) has its maximum when n η = n (3.17) η Λ, i.e., all n η are equal to each other. n can be estimated as n = n n m < n (3.18) where n is the average number of binary decision per symbol. Thus, we have and thereby 1 2 log 2 η Λ n η m 1 2 log 2 n m 1 2 log 2 m n, (3.19) ϕ(u n 1 ) n H + m 1 log 2 2 n + O 3 (1), (3.20) where O 3 (1) = O 2 (1) m 1 m log 2 2 n. (3.21) If m is large enough and n is small, then O 3 (1) may be less than O 1 (1). Moreover, the inequality (3.19) overestimates the second term in (3.13) due to assumptions (3.17) and (3.18). Thus, in many practical cases, especially if the probability distribution is nonuniform (skewed), the use of a binary decomposition combined with binary arithmetic coding may result in a shorter code length compared with m-ary arithmetic coding.

30 24 CHAPTER 3. SOURCE CODING VIA BINARY DECOMPOSITION

31 Chapter 4 Applications of binary decomposition to source coding 4.1 Introduction In this chapter, we use the binarization technique to develop methods for efficient coding of parametric sources such as the generalized two-sided geometric distribution (GTSGD). The GTSGD is an extension of the two-sided geometric distribution (TSGD) introduced in [28]. It often occurs in image compression algorithms. The distribution will be introduced in Section 4.2. The advantage of the GTSGD model is a small number of parameters that allows to avoid the context dilution problem in multi-context algorithms (see Chapter 5). Direct implementation of conventional arithmetic coding requires as many coding parameters as the alphabet size, which is usually much larger than the number of the distribution parameters. We will show that the binary decomposition allows for an efficient solution to this problem. Two approaches are described in Section 4.3. Some experimental results are presented in Section 4.4. Finally, in Section 4.5 we use binary decomposition for redundancy analysis of a widely used Rice coding algorithm [38]. 4.2 Generalized two-sided geometric distribution The generalized two-sided geometric distribution (GTSGD) is deduced as a dead-zone quantization of the off-centered continuous Laplacian distribution f(x) = β 2 exp β x ε, which was shown to be a good approximation of the distribution of prediction errors and transform coefficients [31, 37]. The distribution is specified by the decay parameter β and the offset ε. The dead-zone quantizer is non-uniform and specified by the zero/nonzero quantization intervals Q 0 and Q, respectively. The quantizer and the distribution are depicted on Figure 4.1. We assume that ε < Q 0 /2, i.e., the distribution center falls into the zero quantization bin. This restriction is justified by the fact, that in practice 25

32 26 CHAPTER 4. APPLICATIONS OF BINARY DECOMPOSITION f(x) = β 2 exp β x ε ε 0... Q Q Q 0 Q Q... x Figure 4.1. Example of the off-centered Laplacian distribution and the dead-zone quantizer. for transform coefficients it normally holds, and for context-based prediction schemes, the unit interval containing the center of the distribution can be located by an error feedback loop [60, 68]. The probability distribution of output symbols i Z of the quantizer is defined as p(i = 0) = Q 0 2 Q 0 2 f(x)dx, p(i > 0) = Q 0 2 +iq Q 0 2 +(i 1)Q f(x)dx, p(i < 0) = Q 0 2 +(i+1)q Q 0 2 +iq f(x)dx,

33 4.3. EFFICIENT CODING OF SOURCES WITH THE GTSGD 27 and after a little algebra we find θ ( λ 2 θ γ + θ γ), i = 0 1 p(i) = 2 θ λ 2 γ 1 (1 θ)θ i, i > θ λ 2 +γ 1 (1 θ)θ i, i < 0 (4.1) where θ = exp βq, γ = ε/q and λ = Q 0 /Q are the new distribution parameters. Introducing new parameters, we disengage from the Laplacian distribution and the quantizer parameters. The parameter λ controls the probability of the zero symbol, whereas γ and θ define the off-set and the rate of decay of the distribution, respectively. We shall call the distribution (4.1) the generalized two-sided geometric distribution (GTSGD). It comprises the two-sided geometric distribution (TSGD) proposed in [28] for modeling prediction residuals in lossless image compression algorithms: p(i) = (1 θ)θ i+γ θ 1 γ + θ γ, (4.2) where θ and γ have the same meaning as in (4.1). The TSGD can be derived from (4.1) by setting λ = Q ( 0 Q = ln2 ) ln(θ1 γ + θ γ ). (4.3) ln θ 4.3 Efficient coding of sources with the GTSGD Direct use of m-ary arithmetic coding may be inefficient for coding sources with the GTSGD. The main problem is the alphabet size. Although (4.1) assumes an infinite alphabet, in practice it is finite, but quite large (e.g., the potential range of DCT coefficients in the JPEG image compression standard is supposed to be in the range for images with 8 bits per sample [33]). Using arithmetic coding, one must store and update as many parameters as the alphabet size 1. Furthermore, such skewed distribution may cause waste of the code space and thereby, reduce the coding efficiency. The binary decomposition allows to overcome the aforementioned problems and efficiently code sources with a large alphabet and skewed distribution. Using multiplication free arithmetic coding, one may also design fast algorithms. An efficient method using binary decomposition of the source alphabet combined with a binary arithmetic coder, was proposed for coding DC and AC coefficients of the DCT in the JPEG image compression standard [33]. In this section, we describe two binarization methods for coding sources with the GTSGD. The methods are optimal in a sense that the number of coding parameters 1 Of course, for the source (4.1) one could estimate, store and update only three parameters and calculate the probabilities for the whole alphabet, but this would decrease the coding speed.

34 28 CHAPTER 4. APPLICATIONS OF BINARY DECOMPOSITION (3) is the same as the number of parameters of the distribution (4.1). It is 11 times less than that of the JPEG decomposition (which has 33 parameters). A small number of coding parameters is essential for multi-context algorithms. It may also be beneficial for compression systems with limited memory or hardware resources. Another advantage of the methods is that they are developed for sources with potentially infinite alphabet. It means that there is no limit imposed on the alphabet size. They are suitable for real sources with any alphabet size without any changes in the algorithm. The first method is based on a decomposition tree denoted A, which is a unary representation of the index in the sequence of symbols of the source (4.1) arranged in the order: 0, +1, 1, +2, 2,..., if γ 0 or 0, 1, +1, 2, +2,..., if γ 0. This corresponds to the non-increasing probability order of the source symbols. In the tree A, the path from the root to the leaf i (the source symbol) can be defined as if γ 0 or } 1.{{..1} 0 if i > 0, 2i 1 } 11 {{...1} 0 if i 0, 2 i } 1.{{..1} 0 if i 0, 2i 11 } {{...1} 0 if i < 0, 2 i 1 if γ 0. Without loss of generality, in the following we assume that γ 0. The decomposition tree is depicted on Figure 4.2, where η j, j = 0, 1, 2,... denotes the (decision) nodes. The root node corresponds to η 0. To encode a source symbol i, the encoder codes a sequence of k = 2i 1, if i > 0, or k = 2 i, if i 0, binary symbols 1 followed by 0. The j-th symbol in the binary sequence (0 j k) corresponds to a decision at the node η j and is coded using a single parameter p(0 η j ) (the probability of a decision to be 0 ). It can readily be verified that 1 2 θ ( λ 2 θ γ + θ γ) j = 0, (1 θ)θ γ p(0 η j ) = j = 2k 1, θ γ + θ γ (1 θ)θ γ j = 2k, θ γ + θ 1 γ (4.4) where k = {1, 2,... }. It is clear that even though the tree has infinitely many nodes, only three parameters corresponding to the nodes η 0, η 1 and η 2 are necessary to store and update. The last two are used for coding at all odd and even remaining nodes, respectively. Thus, the

35 4.3. EFFICIENT CODING OF SOURCES WITH THE GTSGD η 0 η 1 i = 0 j = 0 η 2 i = 1 j = 1 η 3 i = 1 j = 2 η 4 i = 2 j = 3 i = 2 j = Figure 4.2: The decomposition tree A. decomposition A is optimal in terms of the number of coding parameters for the source (4.1). The following pseudo C-code implements encoding using the decomposition A: if (i > 0) j := 2i 1 /* mapping the source symbol i */ else j := 2 i /* into its index j */ if (j == 0) encode(0, η 0 ); else { encode(1, η 0 ); j := j 1; η := η 1 while (j > 0) { encode(1, η); if(η == η 1 ) η := η 2 ; else η := η 1 ; j := j 1; } encode(0, η); } where encode(binv alue, η) is a binary arithmetic coding procedure, which encodes the BinV alue using probability distribution defined by the state (node) η. The average number of binary coding operations per source symbol using the decomposition tree A is n A = 1 + θ λ 2 ( 2θ γ + θ γ (1 + θ) ). (4.5) 2(1 θ) The decomposition tree A has an interesting property if λ 1 and θ 1 3. It is optimal in terms of both the number of parameters and the average number of binary

36 30 CHAPTER 4. APPLICATIONS OF BINARY DECOMPOSITION coding operations per source symbol. The first optimality is obvious from (4.4) and the latter is proved in the following theorem. Theorem 4. If λ 1 and θ 1, then the decomposition A is a Huffman tree for the 3 source (4.1). (Note, that we assumed γ 0 and ε Q 0 /2 meaning that γ λ.) 2 Proof. For convenience, we shall refer to the source symbol i Z by the index j = {0, 1,... } of the corresponding node η j, see Figure 4.2. According to Theorem 1 from [17], a binary tree is a Huffman tree iff it has the sibling property. Each node η j, j > 0 of the tree A has its sibling which is the terminating node (leaf) j 1 (see Figure 4.2). The probability of the node is p(η j ) = The tree A has the sibling property, if p(k). (4.6) k=j p(η j ) p(j), p(η j ) p(η j+1 ), p(j 1) p(j), p(j 1) p(η j+1 ), (4.7) for every j = {1, 2,... }. The first two inequalities are obvious from the fact that p(η j ) = p(η j+1 ) + p(j), see (4.6). The third inequality follows from the distribution (4.1). It is easy to show that the forth inequality is satisfied for any γ [ ] [ 0, λ 2 and θ 0, 1 3]. The theorem essentially states that no one decomposition can achieve a smaller n for the distribution (4.1) in the stated range of the distribution parameters. If λ < 1, then the deduction of the parameters range, which guarantees the decomposition A to be a Huffman tree, includes non-trivial algebra and it is not interesting from a practical point of view. For λ 1 and θ 1 3, n A 2.44 and the entropy of the source H Hereafter, a source with the entropy H 2.36 bits will be referred to as a low entropy source, and a source with H > 2.36 bits as a high entropy source. The drawback of the decomposition A is that for the high entropy sources the tree is not a Huffman tree. A good trade off is a tree, which provides a reasonable number of binary coding operations per input symbol for a wide range of θ, while having a small number of parameters and a simple structure. For the high entropy sources, e.g., for coding of prediction residuals in lossless image compression algorithms, we propose the decomposition tree denoted B. The first decision in this decomposition is whether the symbol i is zero or not (let it be 0 if i = 0 and 1 otherwise). If not, then the second decision is whether the symbol is positive or negative (let it be 1 if i > 0 and 0 otherwise), and then use unary decomposition of i 1 (let it be i 1 1 s followed by 0 ). Figure 4.3 shows the tree.

37 4.3. EFFICIENT CODING OF SOURCES WITH THE GTSGD η 0 η 1 i = 0 η 2 η 3 η 4 i = 1 η 5 i = 1 i = 2 i = Figure 4.3: The decomposition tree B. The probability of the decision 0 at the node η i is defined as 1 2 θ ( λ 2 θ γ + θ γ), j = 0, p(0 η j ) = θ γ, j = 1, θ γ + θ γ 1 θ, j = 2, 3,... (4.8) Thus, for this decomposition tree, as well as for the decomposition A, the parameters at only three nodes η 0, η 1 and η 2 are sufficient to store and update. However, we propose to keep the statistics for positive and negative branches separately, i.e., use a separate parameter p(0 η 3 ) for the node η 3. This allows for capturing the statistics if the actual probability distribution of source symbols has different slopes for positive and negative values. Thus, we propose the total number of parameters for this decomposition to be 4. Decisions at the nodes η 2j and η 2j+1, j = 1, 2,..., are coded using the parameters p(0 η 2 ) and (0 η 3 ) respectively. Encoding using the decomposition B is implemented by the following pseudo C- code: if (i == 0) encode(0, η 0 ); else { encode(1, η 0 ); if (i > 0) { encode(1, η 1 ); i := i 1; while (i > 0) { encode(1, η 2 ); i := i 1; }

38 32 CHAPTER 4. APPLICATIONS OF BINARY DECOMPOSITION } } else { } encode(0, η 2 ); encode(0, η 1 ); i := i 1; while (i > 0) { encode(1, η 3 ); i := i 1; } encode(0, η 3 ); The average number of binary coding operations per input symbol using this decomposition is n B = 1 + θ λ 2 2 ( ) 2 θ (θ γ + θ γ). 1 θ For the low entropy sources n B / n A < 1.34 and for the high entropy sources 0.5 < n B / n A < 1 (for H > 7.5 bit, 0.5 < n B / n A < and if H, then n B / n A 0.5, γ [0, λ 2 ]). 4.4 Experimental results To evaluate efficiency of the proposed methods, we implemented them for entropy coding of prediction residuals and DCT coefficients for lossless and lossy compression techniques, respectively. The aim was to evaluate the compression and speed efficiency for coding of data with different entropy and for different applications and compare them with that of the decomposition used in the JPEG standard. Binary coding for both decompositions, as well as for the JPEG decomposition, was implemented using the QM-coder [33]. For testing purposes we used nine 8-bit grayscale images of size , which are available via the internet 2. In the first case, we implemented a simple prediction technique and coded the prediction residuals. We used the average of the pixel above v[y 1, x] and the pixel to the left, v[y, x 1], as the prediction value for the current pixel v[y, x]. This corresponds to the predictor No.7 of the lossless mode of the JPEG standard. The sequence of prediction residuals in the raster scanning coding order was treated as a memoryless source. Table 4.1 shows the resulting number of bytes of the compressed images. Table 4.2 gives the number of binary coding operations per pixel. We used this parameter to evaluate the speed performance. The Diff. column in all tables gives the relative differences of figures between the JPEG decomposition and the proposed ones. 2 ftp://ftp.csd.uwo.ca/pub/from wu/images

39 4.4. EXPERIMENTAL RESULTS 33 Image JPEG decomp. Decomp. A Diff. Decomp. B Diff. baloon % % barb % % barb % % board % % boats % % girl % % gold % % hotel % % zelda % % average % % Table 4.1: The number of bytes for lossless image compression. In the second case, we coded DCT coefficients, acquired from the JPEG compressed files with the average compression ratio about 7:1 3. The coefficients of the same spatial frequency were coded together as a standalone message of a memoryless source with the GTSGD. We did not use the end of block symbol. The reason was to compare the average compression performance for sources with different entropy, which, in essence, were formed by the coefficients with different spatial frequencies. Tables 4.3 and 4.4 present the number of bytes of the compressed images and the number of binary coding operations per pixel, respectively. Experimental results show that the compression performance of both proposed decompositions is not worse than that of the JPEG standard, while they have a much smaller number of the parameters. The decompositions A and B require to store and update only 3 and 4 parameters, respectively, versus 33 parameters for the JPEG decomposition. (Therefore, traversing the tree is also simpler.) The decomposition A has an obvious advantage for sources with entropy H 2.36 bits, since it has the smallest number of binary coding operations per input symbol. The decomposition B has about half the number of coding operations per input symbol than the decomposition A for high entropy sources. Although, the decomposition B has about 20% more binary coding operations per symbol compared with the JPEG decomposition, the real difference of speed performance may be less than this figure. This is because in the coding process passing the JPEG tree decomposition is a more costly operation than passing the tree B (there are fewer nodes to pass). Nonetheless, both decompositions allow for improvement of speed performance in different ways. A straightforward way is to use one of a number of proposed multiplication free binary arithmetic coders (see, e.g., [21, 24, 34]). Another way is to use run-length Golomb codes [19]. For example, the Rice coding algorithm [38] can be viewed as a fast version of the decomposition A, where the decisions are treated as a binary memoryless source and coded using Golomb codes. (The technique will be described 3 The reason for such a ratio was to have the GTSGD sources in a wide range of distribution parameters (see below in the paragraph). Otherwise, for higher compression ratio most of the high frequency coefficients would be zero.

40 34 CHAPTER 4. APPLICATIONS OF BINARY DECOMPOSITION Image JPEG decomp. Decomp. A Diff. Decomp. B Diff baloon % % barb % % barb % % board % % boats % % girl % % gold % % hotel % % zelda % % average % % Table 4.2. The average number of binary coding operations per pixel for lossless image compression. Image JPEG decomp. Decomp. A Diff. Decomp. B Diff baloon % % barb % % barb % % board % % boats % % girl % % gold % % hotel % % zelda % % average % % Table 4.3: The number of bytes of compressed DCT coefficients. Image JPEG decomp. Decomp. A Diff. Decomp. B Diff baloon % % barb % % barb % % board % % boats % % girl % % gold % % hotel % % zelda % % average % % Table 4.4. The average number of binary coding operations per pixel for the DCT coefficients.

41 4.5. ON REDUNDANCY OF RICE CODING 35 in details in the next section, where we will analyze its redundancy properties.) The speed improvement of the fast version is at the price of higher redundancy. The relative redundancy of the Rice algorithm, as it will be shown, is upper bounded by 50%, if H 0, and for 0.05 < H < 1 it is about 10%-30%. The speed performance of the decomposition B, combined with the QM-coder (like in our experiment), can be improved by using speed up mode [33], where a series of binary decisions is coded at once. Another way is to send the second decision (the sign bit) directly to the output bit stream. This will increase the code length in case of the distribution asymmetry (γ > 0). However, in most applications γ 0 (especially for low entropy sources). That is, in practice, the increase usually is negligibly small. One more advantage of the proposed methods is that there is no embedded limit on the alphabet size, thereby any value can be encoded and there is no need to adjust the algorithm while designing, for example, a compression system for images with different number of bits per pixel. 4.5 On redundancy of rice coding Rice coding [38] (or the Rice Algorithm) is a widely used technique in image compression for entropy coding due to its efficiency and simple implementation. It is recommended as the base of a standard for space image compression applications [39] and in a modified version it is used in the recent standard for lossless image compression JPEG-LS [60] as a part of the entropy coder. (By Rice coding we assume a technique that consists of Rice preprocessing [39] followed by run-length coding using Golomb [19] or Rice codes, also called Golomb-power-of-2 [60] (GP2) codes. We describe the technique in details below.) The technique can be viewed as a special case of source coding using the decomposition tree A, as we mentioned in the previous section. In this section we investigate the efficiency of this algorithm in terms of relative per-symbol coding redundancy, i.e., the redundancy caused by the method of coding given the source parameters, as a function of these parameters. This will allow us to show the potential performance of Rice coding technique for different practical situations. Rice coding is applied to a source modeled by integers with a probability mass function p(i), i Z, which satisfies the property: p(0) p(+1) p( 1) p(+2).... The algorithm consists of two steps. In the preprocessing step, the source symbol i is first mapped into its index j = {0, 1, 2,... } in the sequence of symbols arranged in order 0, +1, 1, +2, 2,.... Each index is then unarily coded and the sequence of index codewords is concatenated to form a sequence of binary symbols called the fundamental sequence (FS) (we assume that unary representation of the symbol j corresponds to the sequence of j ones followed by zero). The output of the preprocessing step (the fundamental sequence) is entropy coded using run-length coding based on Golomb or GP2 codes.

42 36 CHAPTER 4. APPLICATIONS OF BINARY DECOMPOSITION If the most probable symbol in the FS is 1, e.g., when coding prediction residuals, then the algorithm performs symbolwise coding. Otherwise, it performs block coding, thus allowing for entropy coding of sources like the DCT or wavelet transform coefficients after quantization. The key property of the method is that there is no need to store any code tables. Given the source symbol or the block, its codeword can merely be calculated. The calculation is simpler for GP2 codes (although, for some loss in compression performance). It is easy to see that Rice preprocessing can be viewed as using the decomposition tree A of source symbols (see Fig. 4.2). From this point of view the FS is in essence a sequence of decisions, which is run-length coded using Golomb or GP2 codes, treating it as a binary memoryless source. In this case, a single parameter, which we shall denote ˆp, characterizes the binary sequence. It corresponds to the zero-order probability of a decision to be 0, which can be found as ˆp = 1 n, where n is defined by (3.1). Assuming that the sequence of decisions is a memoryless binary source, its entropy is h = ˆp log ˆp (1 ˆp) log(1 ˆp), and hence we can introduce the quasi entropy of the input source as Ĥ = h n = nlog( n) ( n 1) log( n 1). (4.9) We estimate the redundancy by the ratio ϱ 0 = Ĥ H H, (4.10) where H = i Z p(i) log p(i) is the real entropy of the source. In Rice coding, the decisions are treated as a memoryless binary source, i.e., assuming that all the nodes of the decomposition tree have the same probability of a decision to be 0. This is true only if the index in the sequence of rearranged source symbols in non-increasing probability order has one-sided geometric probability distribution 4. In all other cases such a coding will result in some redundancy, depending on the distribution of the source 5. Here we assume that Rice coding is mainly used to code source symbols such as prediction residuals in lossless image compression or transform coefficients after quantization in lossy image compression. This kind of source can be approximated by the GTSGD (4.1). 4 p(j) = (1 Θ)Θ j, 0 < Θ < 1, j = 0, 1, 2,..., see [19]. 5 This redundancy can be thought of as a measure of closeness to the one-sided geometric distribution.

43 4.5. ON REDUNDANCY OF RICE CODING 37 Given the distribution (4.1), n is defined by (4.5) and the entropy of the source is given by ( H = 1 1 ) ( 2 θ λ 2 (θ γ + θ γ ) log 1 1 ) 2 θ λ 2 (θ γ + θ γ ) 1 [(( ) 2 θ λ λ log 2 θ + log 2 (1 θ) + log ) 2 θ (θ 1 θ 1 γ + θ γ) + γ log 2 θ (θ γ θ )] γ. (4.11) Using (4.5), (4.9) and (4.11) the redundancy ϱ 0 can be easily calculated by (4.10). We shall consider the behaviour of ϱ 0 for λ 1 defined by (4.3) (i.e., for the TSGD model) and λ 2 = 1 (for uniform quantization). Given θ, ϱ 0 has its maximum when γ = 0 for both cases. Thus, setting γ = 0 we get the upper bound on ϱ 0. It can be readily shown, that for λ 1 and λ 2 : and lim ϱ 0(θ) = 0 θ 1 lim θ 0 ϱ 0(θ) = 1 2. Figure 4.4 shows the relative redundancy as a function of the entropy for sources defined by λ 1 and λ 2. One can see, that in both cases the redundancy is approximately the same, being less than 10% for sources with the entropy H 1 bit. It tends to zero if H. The redundancy is also upper bounded by 50% if H 0. Note, that ϱ 0 is the ideal relative redundancy, that is caused only by the assumption of the sequence of decisions being a memoryless source. It allows to see potential efficiency of the algorithm. In order to estimate the actual redundancy, we have to add the redundancy, caused by the method of coding. The overall relative redundancy is defined by ϱ = Ĥ + nρ b 1, (4.12) H where ρ b is the absolute per (binary) symbol redundancy, caused by the method of binary coding. If GP2 codes are used to code the sequence of the decisions, then ρ b = l ˆpl m(1 ˆp l m) 1 1 ˆp m (4.13) where ˆp m = max{ˆp, 1 ˆp} is the probability of the most likely symbol (decision) and ( ) log l = 1 log 2 ˆp m 2 log 2 ( (4.14) 5 1) 1 is the parameter 6. Figure 4.5 shows the resulting relative redundancy as a function of the entropy for GP2 codes and the source defined by λ 1 (we assumed γ = 0, thereby, this figure shows 6 The derivation of this formula is based on Lemma 4 from [28].

44 38 CHAPTER 4. APPLICATIONS OF BINARY DECOMPOSITION ϱ λ λ H Figure 4.4. Relative per-symbol redundancy of Rice coding as a function of the source entropy for λ 1 and λ 2. the upper bound). The relative redundancy is less than 10% if the entropy H 1. In practice, for the low entropy sources, one may expect a redundancy of about %. In some practical implementations, this may be a reasonable price for the possibility of very fast coding. 4.6 Summary In this chapter, we investigated source coding based on a binary decomposition of source symbols. We showed, that the method usually results in more efficient entropy coding algorithms than conventional m-ary arithmetic coding. It allows for having as small a number of coding parameters as the number of source distribution parameters. This property is essential in multi-context coding algorithms, where a large number of coding parameters may reduce compression efficiency due to the high model cost. Binarization also helps to overcome the problem of wasting the code space for sources with large alphabets and/or skewed distribution.

45 4.6. SUMMARY 39 ϱ H Figure 4.5. Relative redundancy of the algorithm if GP2 codes are used to code the sequence of binary decisions and the model is defined by λ 1. The dotted line shows the ideal redundancy from Figure 4.4 for comparison purpose.

46 40 CHAPTER 4. APPLICATIONS OF BINARY DECOMPOSITION

47 Chapter 5 Context modeling for image compression The search for a good source model is a central task in developing any compression algorithm. The better the model approximates the statistical properties of a source, the higher compression can be achieved. In this chapter, we describe a unified approach to the design of optimized context models based on training data, primarily intended for image compression algorithms. The two key techniques, context parameter initialization and context quantization, are presented and discussed. 5.1 Introduction In Chapter 2 we described the general structure of an image coding system, where the necessary part is the entropy coder, which essentially performs the compression. The entropy coder treats the (transformed and maybe quantized) image data as an output of some information source. It sequentially assigns the coding probability to the message according to the source model (see Figure 2.1), which is assumed to generate the data. The arithmetic coder performs calculation of the codeword, the length of which is within 2 bits of the possible minimum (disregarding the calculation precision problem): ϕ(u n 1 ) < log 2 q(u n 1 ) + 2, (5.1) where q(u n 1 ) is the coding probability of the message un 1. The key role of the source model is to provide such a sequence of coding probabilities q(u t+1 u t 1), t = 1, 2,...n that n 1 q(u n 1 ) = t=0 q(u t+1 u t 1 ). (5.2) is maximized, yielding the minimum code length or maximum data compression. Clearly, the actual model, generating symbols at the input of the entropy coder, is normally not known and the universal source coding methods are called upon to tackle 41

48 42 CHAPTER 5. CONTEXT MODELING FOR IMAGE COMPRESSION the problem. Yet, to apply universal coding, we still need to specify the set of models Ω = {ω}, which, according to our belief, includes the actual model for our data or a good approximation of it. The set Ω usually is a set of FSM models. A parametric set is specified by the set of model parameters Ω = {P s (A), s S} with the same (known) model structure F(s t, u t 1) [11, 23]. It may also be a set of FSM models with different structure Ω = {F(s t, u t 1 )} and known parameters {P s (A), s S} [11]. A double universal set of models is given by the set of different structures and the set of parameters, see, e.g., [13, 44, 45]. Since the model parameters are rarely known in practice, the most used are the parametric and the double universal sets of models. A double universal set is the most attractive since a wide range of models allows for a better approximation of the message statistics. An example of a double universal set of models is the set of Markov chain models of different order or a set of variable-order Markov chain models (FSMX models) [44, 58]. A more general approach is a set of context tree (CT) models [52] lacking Markov property. These are tree structured FSM models. However, the imposed (tree) structure may not fit well for multi-dimensional data. It was shown that in compression of gray-scale images, a carefully designed parametric set of FSM models usually yields a better performance than a double universal set of FSMX models, see, e.g., [68, 63]. In this chapter, we address the problem of finding an efficient (in the sense of minimizing the code length) parametric set of FSM models. In Chapter 7 we will propose some methods for the double universal coding using a set of specially designed tree structured models. We define two key steps in the search of a good context model: context formation, model optimization. In image compression, the FSM model is usually defined as follows: s t+1 = F(s t, u t 1) = F(u l (t)), (5.3) where l 0 and u l (t) is a substring made up of l symbols drawn from the so far coded message u t 1 (u0 (t) = ), i.e., the state is a function of some finite fixed size (or bounded) subset of the so far coded symbols. The substring u l (t) is called the coding context or simply the context. The context formation is an ad-hoc step and concerns the choice of the context size l (the number of context symbols), the actual symbols, which form the contexts, and the mapping function F( ). It results in a raw model M. The model is defined by the set of possible contexts {u l } and the structure function F(u l ), which specifies the set of the model states S = {s}: F(u l ) : u l A l s S, (5.4) ( S A l ). This problem is discussed in the next section. Model optimization aims at tuning the raw model, acquired in the first step, to the data statistics based on a prior domain knowledge. The prior is assumed to be

49 5.2. CONTEXT FORMATION 43 given as a set of some training data samples. The optimization reduces the model size and defines model parameters (or a prior on parameters). Model optimization is a crucial step in the model design. It is assumed to be carried out off-line. The detailed description of model optimization techniques, based on training statistics of some sample data, is presented in Sections Context formation The first task in context formation is the choice of the context length l. The number of context symbols defines complexity (size) of the raw model. Usually it is specified empirically taking into account the supposed model structure F( ). The structure function in (5.4) can be just the one-to-one mapping: s t+1 = u l (t). (5.5) Then the set of conditional states S is defined by the set of all possible sub-strings: S = {s = u l A l }. This model is similar to an l-order Markov chain model except that the substring may consist of symbols, which are not necessarily the last source letters. Such a model is often used in coding of binary images. However, if the alphabet size is large enough (e.g., m 2), this approach may result in an inferior compression performance. The problem is formulated as follows. Using universal coding, the redundancy (2.23) or (2.24) (the increase of the code length from the possible minimum) caused by the lack of knowledge about the source is proportional to the model size Ω (the number of free parameters). The lower bound on the length of a universal code is given by [44, 52] ϕ(u n 1) nh + Ω 2 log 2 n + O(1), (5.6) which can be achieved by sequential coding using conditional distributions defined by (2.29) (H is the source entropy). The size of the raw context model defined by (5.5) is Ω = (m 1) l+1. (5.7) Thus, by including more symbols in the context, on the one hand, one may better approximate the statistical properties of data and reduce the source entropy H. On the other hand the second term defining the universal coding redundancy (also called the model cost [40]) may have a great impact on the code length for finite length data, if Ω is comparatively large. In the literature, this is also called the context dilution problem [57]. Informally, the more parameters the model has, the longer sequence of past symbols is required for reliable estimation of the source parameters. The model size grows exponentially in the number of context symbols. Including even the most statistically relevant symbols in the context may result in a model of an unacceptable size. Even though further optimization is supposed to handle this problem, too large raw models may reduce efficiency of the optimization due to the

50 44 CHAPTER 5. CONTEXT MODELING FOR IMAGE COMPRESSION lack of the training statistics. In this case, we need to find a model structure that maps the set of all possible contexts A l into a smaller set of conditional states S such that S A l. We shall call it a context pre-quantization. This is an ad-hoc step based on some assumptions on the data to be coded. (If there is enough training data for optimization of this process, this is to be performed in the model optimization.) One approach, which we shall call the texture+energy principle, lead to the most efficient image compression algorithms such as CALIC [68] and ECECOW [64]. This kind of modeling is applied for sources that can be modeled as a Gaussian or Gaussian-like random process (for example, a sequence of coefficients of some orthogonal transform). In this case, the context is divided into a differential part (texture) and an integral part (energy), which may be quantized 1 and combined in a product context model. The texture part of the transform coefficients, for instance, may be defined as a binary vector, were the values are defined by the sign of the context coefficients. The energy may be defined as a (weighted) sum of the coefficients. The next step in context formation is to choose the symbols from the message u t 1 that would reflect the statistical properties of the data in the best possible way (or, equivalently, result in the minimum entropy H). The main principle lying behind the choice of the context symbols is to pick up those ones that have the most statistical influence on the symbol being coded. These are normally the spatially closest symbols. The positions of the context symbols define a context template. For one dimensional data it corresponds to the message suffix of length l. For 2- or more dimensional data, such as images, the suffix alone does not include all the closest symbols. For example, in the raster scanning coding order of transform coefficients c[x, y], 1 x X, 1 y Y, as it is shown on Figure 5.1, the closest 6 symbols are c[y, x 1], c[y, x 2], c[y 1, x 1], c[y 1, x], c[y 1, x + 1], c[y 2, x], where only c[y, x 1] and c[y, x 2] correspond to the immediate past message symbols: u t and u t 1. The others correspond to the symbols u t X, u t X+1, u t X+2, and u t 2X+1, respectively, where X is the image width. The most statistically relevant symbols may be defined in a more complex form, like, e.g., in coding of wavelet transform coefficients, where the context may include information (symbols) form other sub-bands (see, e.g., [64]). 5.3 Context model optimization The raw model M defined in the previous section can be used for source coding, where the conditional coding probabilities calculated by (2.29) drive an arithmetic coder. However, the model can be further optimized based on prior knowledge of the data to be coded. We define two procedures in model optimization: context (parameter) initialization and context quantization. In context initialization, the statistics of the training data is used to specify the model parameters. It can be done in the form of setting fixed 1 This process can be optimized individually for the texture and energy contexts, since there are usually much fewer of them and training data can be efficiently used. Yet, the overall process is rather heuristic.

51 5.3. CONTEXT MODEL OPTIMIZATION 45 x c[y 2, x] (u t 2X+1 ) y c[y 1, x 1] (u t X ) c[y 1, x] (u t X+1 ) c[y 1, x + 1] (u t X+2 ) c[y, x 2] (u t 1 ) c[y, x 1] (u t ) c[y, x] (u t+1 ) current symbol Figure 5.1. Example of a 6-symbol context template for the raster scanning coding order. values or defining a prior distribution on the parameters. Let us assume an FSM model with the set of states S and the alphabet A. The training data set is denoted as {ũñs 1, s S} and its statistics is defined by the set of symbols counts {ϑ(a ũñs 1 ), a A, s S}. The model parameters can then be specified by the estimates P s (A) = {ˆp(a s)}, s S, where ˆp(a s) = ϑ(a ũñs 1 ) a A ϑ(a ũñs 1 ), (5.8) (we assume that the training set is large enough to ensure ϑ(a ũñs 1 ) > 0, a A, s S). This yields a model with fixed parameters. In case of statistical mismatch such a model may result in non-efficient coding. A more robust approach is to set a prior distribution on the model parameters. The prior on the parameter vectors {P s (A)} can be defined in the form of Dirichlet density distribution [23] specified for every context s S: G(P s (A) a s ) = Γ ( a A α a(s) ) a A Γ(α p(a s) αa(s) 1, (5.9) a(s)) where Γ( ) is a Gamma function and the parameter vectors a s = {α a (s), a A}, s S, are defined by the statistics of the training set. In the next section, we will describe two methods for finding {a s }. The values {α a (s)} correspond to the parameters of the sequential probability estimation rule (2.28) for each context s S. a A

52 46 CHAPTER 5. CONTEXT MODELING FOR IMAGE COMPRESSION Context quantization is intended for reduction of the number of model states by merging some contexts into one model state if their statistics is similar. Basically, it optimizes the structure function F( ) with respect to some criteria. Different methods and criteria for state merging will be described in Section 5.5. Both context initialization and quantization can be performed independently from each other. However, their combination allows for maximal reduction of the model cost w.r.t. the training data. 5.4 Context initialization Context initialization with fixed values defined by (5.8) is straightforward. In this section we consider a more difficult problem of setting an optimal prior on the model parameters, i.e., finding values of the vectors a s = {α a (s)}, s S. In the following we consider the initialization problem only for memoryless sources. For FSM sources, the initialization is carried out independently for each state (context) regarding it as a separate i.i.d. source. Note, that even though the term context initialization does not make sense for i.i.d. sources (there are no conditional contexts), we will still keep it assuming that the initialization is primarily intended for FSM sources. The use of prior (side) information was considered in [1], where the problem is formulated as follows. Let Ω be a set of parametric memoryless sources with the alphabet A = {a 0, a 1,...,a m 1 }. Let ω Ω be the source that we want to code defined by the parameter vector (probability distribution of the source symbols) P(A) = {p(a), a A} and u n 1 be the corresponding message. Let ω Ω be the source providing side information defined by P(A) = { p(a), a A}. The parameter vectors P(A) and P(A) are not known, but the training message ũñ1 generated by the source ω is available to both encoder and decoder. Furthermore, it is known that the similarity of the sources measured by the pseudo-distance [1] D = a A p(a) 2 p(a) (5.10) satisfies D ζ for some ζ > 0. The choice of the parameters α a in sequential estimation of the coding probability (2.29) α a = α 1 (ñ, ζ)ϑ(a ũñ1) + α, a A, (5.11) where m 1 α 1 (ñ, ζ) = 2ñζ + m 1 and α > 0 is some constant, allows to reduce the code length by (m 1) 2n 2 ln 2 (1 α 1(ñ, ζ)), ϕ(u n 1) ϕ(u n 1 ũñ1) ζ m 1 2, (m 1) 2 4n 2 ln 2 (1 α 1(ñ, ζ)), ζ > m 1 2. (5.12) (5.13)

53 5.4. CONTEXT INITIALIZATION 47 The value ϑ(a ũñ1 ) in (5.11) defines the counts of the symbol a in the message ũñ1. The main drawback of this method is that the parameter ζ is unknown, since we do not know P(A). However, the use of multiple sample messages allows to overcome this problem in the following way. Let us assume a set of training messages {ũñj 1 (j) Ω, j = 1, 2,..., J }. We define p j (a) = ϑ(a ũñj 1 (j)) ñ j (5.14) and Then ζ can be estimated as: p(a) = j J ϑ(a ũñj 1 (j)). (5.15) j J ñj where ζ = max j J {D j} (5.16) D j = a A p j (a) 2 p(a)). (5.17) The use of multiple training samples allows for another approach to initialization based on the minimum (adaptive) code length (MCL) criteria. We will call it the MCL-initialization. Definition 1. Given 0 < α 1 and a set of sample data {ũñj 1 (j), j = 1, 2,...,J}, the MCL-initialization of an m-ary memoryless source is defined by the prior distribution (5.9) with the parameter vector a = {α a, a A} where α a = ϑ init (a) + α, (5.18) The values (initial counts) ϑ init (a) N ( N is the set of nonnegative integers) are chosen to minimize the sum ϕ(ũñj 1 (j), {ϑ init (a)}). (5.19) j J where ϕ(ũñj 1 (j), {ϑinit (a)}) is the code length, corresponding to the message ũñj 1 (j). The parameters (5.18) define the coding probabilities (2.29) in the form q t+1 (a u t 1) = ϑ(a ut 1 ) + ϑinit (a) + α t + a A ϑinit (a) + mα and the corresponding code length is given by (5.20)

54 48 CHAPTER 5. CONTEXT MODELING FOR IMAGE COMPRESSION ϕ(ũñj 1 (j), {ϑ init (a)}) = ( log 2 (Γ ñ j + ) ϑ init (a) + mα Γ ( ϑ init (a) + α )) a A ( ) ( log 2 Γ ϑ init (a) + mα log 2 a A a A a A Γ(ϑ(a ũñj 1 (j)) + ϑinit (a) + mα ). (5.21) The use of multiple training messages is principal for this method, since for a single massage the optimal counts are defined by ϑ init (a) = kϑ(a ũñ1 ), k N, k such that q t+1 (a u t 1) = ϑ(a ũñ1 ) = constant. (5.22) ϑ(a ũñ1) a A In the general case, ϑ init (a) may tend to infinity in some cases, for example, if all the training messages have the same counts for all the symbols: ϑ(a ũñk 1 (k)) = ϑ(a ũñl 1 (l)), a A; k, l {1, 2,..., J }. For this reason, in practical implementation, the search for optimal initialization values is performed in the range [0, N 0 ] [0, N 1 ] [0, N m 1 ] N m, where N i are non-negative integers defining the search range for ϑ init (a i ), respectively. A brute force algorithm requires O ( mj m 1 i=0 N i) time to find optimal values. The restriction on the search range guarantees that the solution will always be found. One can use the monotonicity of the function ϕ(ũñj 1 (j), {ϑinit (a)}) and the fact that optimal counts lie close to the line defined by the ratio (5.15) to find a faster algorithm 2. The proposed MCL-based method does not seem to be appropriate for sources with a large alphabet size (m 2) due to possible high search time. Yet, for binary sources, it is quite feasible. In this case, the search is performed on a rectangle grid defined by [0, N 0 ] [0, N 1 ]. The calculation of the code length (5.21) can be performed using Stirling s approximation [14]: which yields Γ(x) 2πx x 1 2 e x, (5.23) ϕ(ũñj 1 (j), ϑ init 0, ϑ init 1 ) ( ) ( ϑ init 0 + ϑ init 1 + ϑ j 0 + ϑ j log2 2 ϑ init 0 + ϑ init 1 + ϑ j 0 + ϑ j ) +ϑ init 0 log 2 (ϑ init 0 + 1) + 2 ϑinit 1 log 2 (ϑ init 1 + 1) 2 (5.24) ( ϑ init 0 + ϑ init ) log2 (ϑ init 0 + ϑ init 1 + 1) (ϑ init 0 + ϑ j 0 ) log 2(ϑ init 0 + ϑ j ) (ϑinit 1 + ϑ j 1 ) log 2(ϑ init 1 + ϑ j ), 2 Design of an optimal algorithm for this task is out of scope of the thesis.

55 5.5. CONTEXT QUANTIZATION 49 where for brevity we use the short notation ϑ init a = ϑ init (a), ϑ j a = ϑ(a ũñj 1 (j)), a = 0, 1, and set α = 1 2. For m-ary sources, the MCL-initialization can be significantly simplified by using a binary decomposition technique described in Chapter 3. In this case, the search time becomes of order O(mJ N 0 N 1 ). 5.5 Context quantization Context quantization of an FSM model is formulated as follows. Let S = {s} be a set of raw contexts (states) of some FSM model M with the alphabet A = {a 0, a 1,...,a m 1 }, and S = {s} be the set of states of the output model M. Definition 2. Context quantization Q(S) is a mapping such that S < S, s S: s S s = S, s S s =. Different criteria can be used for context quantization. Q(S) : s S s S (5.25) Definition 3. Given the set of probability distributions P s (A) and the number of output clusters S, optimal minimum conditional entropy context quantization (MCECQ) minimizes the conditional entropy H(A Q(S)) = H(A S). In other words, optimal context quantizer minimizes the distortion of Q: D(Q) = H(A Q(S)) H(A S). (5.26) The MCECQ technique was introduced in [66, 67] 3. It was shown that in the general case, this is a vector quantization problem in the probability space, where the distortion measure is given by information divergence [67]. For a binary alphabet, since its probability simplex is one-dimensional, finding the mapping function Q can be reduced to a scalar quantization problem. In this case, the quantization regions are intervals in the probability space, defined by some set of thresholds {h k, k = 0, 1,..., S } specifying the quantization clusters: s k = {s : p(0 s) (h k 1, h k ]}, (5.27) k = 1, 2,..., S, where p(0 s) is the probability of the zero symbol corresponding to the context s. Optimal quantizer can be found by searching over the set {h k } that minimizes (5.26). The resulting conditional entropy is defined by S H(A 2 Q(S)) = p(s)h(a 2 s), (5.28) where A 2 denotes a binary alphabet and p(s) = Pr(p(0 s) (h k 1, h k ]). 3 In the original papers, the mapping was presented in a more general form. k=1

56 50 CHAPTER 5. CONTEXT MODELING FOR IMAGE COMPRESSION For the binary case, the optimal quantization problem can be solved via dynamic programing as follows [66]. Let the set of contexts {s} be indexed in order of the corresponding conditional probability such that p(0 s k 1 ) < p(0 s k ), k = 2, 3,..., S, and let {p k, k = 0, 1,..., S }, be the set of values { 0, k = 0, p k = (5.29) p(0 s k ), k = 1, 2,..., S. Let for each pair l, k [0, S ] : l < k, denote s lk = {s l+1,...,s k }, p(s lk ) = Pr(p(0 s) (p l, p k ]), and define the cost function C 1 (l, k] = p(s lk )H(A 2 s lk ). (5.30) Let C z (0, k] be the cost of optimal partial z-cell clustering of the set {s 1, s 2,...,s k }. It can be found using (5.30) by the recursion C z (0, k] = min z<l<k {C z 1(0, l] + C 1 (l, k]}. (5.31) The following procedure [66] finds the set of thresholds {h 0, h 1,...,h S } defining optimal clustering (5.27) that corresponds to C S (0, S ]: /* initialization */ for (l = 0; l < S 1; l + + { for (k = l + 1; k S ; k + +) { C 1 (l, k] := p(s lk )H(A 2 s lk ); } } /* find C S (0, S ]*/ for (z = 2; z S ; z + +) { for (k = z; k S ; k + +) { } } C z (0, k] := min l l z (0, k] := arg min l /* extract thresholds {h k } */ h 0 := 0; h S := p S ; k := S ; for (z = S ; k 2; k ) { k := l z (0, k]; h z 1 := p k ; } output {h 0, h 1,...,h S }. {C z 1 (0, l] + C 1 (l, k]}; {C z 1 (0, l] + C 1 (l, k]};

57 5.5. CONTEXT QUANTIZATION 51 The use of context quantization assumes that the probabilities {p(0 s), s S} are not known (otherwise, it would be better, w.r.t. compression, to use them without quantization of the contexts). In the off-line quantization, the statistics of some set of training messages {ũñs 1 (s), s = 1, 2,..., S} can be used. In this case, the optimization is performed based on the probability estimates ˆp(0 s) = ϑ(0 ũñs 1 (s)) ñ s. (5.32) The estimates (5.25) can be also used for complete specification of the model. The model can be exploited for encoding other sources with the statistics similar to that of the training set. Another approach to context quantization based on the minimum adaptive code length criterion for the training data {ũñs 1 (s), s = 1, 2,..., S} was proposed in [15]. Let ũñs 1 (s) = s s ũñs 1 (s) (5.33) be a sequence of length ñ s = s s ñs containing all the strings {ũñs 1 (s), s s}. Let ϕ(s) = ϕ(ũñs 1 (s)) be the code length corresponding to the grouped state s. Definition 4. Given a set of training messages {ũñs 1 (s), s = 1, 2,..., S}, the minimum adaptive code length context quantization (MCLCQ) is defined by mapping (5.25) that minimizes the total code length for the training set: ( ) ϕ s S ũñs 1 = ϕ(s). (5.34) s S The use of sequential coding probability estimation rule ϑ(a ũ t s s q t s+1(a ũ t 1 (s)) = 1 (s)) + α a (s) t s + a A α a(s), (5.35) t s = 0, 1,..., ñ s 1, allows to calculate the (adaptive) code length ϕ(s) for each grouped context s. The code length does not depend on the order of appearance of symbols in the string ũñs 1 (s) and it is uniquely defined by the symbol counts ϑ(a s) = s s ϑ(a ũñs 1 ), a A: ( ) Γ α a (s) ϕ(s) = log 2 a A ( Γ ñ s + ) Γ(ϑ(a s) + α a (s)) a A, (5.36) α a (s) Γ(α a (s)) a A a A where {α a (s), a A} define a prior on the probability distribution P s (A) corresponding to the context s (see Section 5.4). Thus, MCLCQ allows to combine context initialization and quantization yielding a model, which minimizes the total adaptive code length for the training set.

58 52 CHAPTER 5. CONTEXT MODELING FOR IMAGE COMPRESSION Definition 5. A minimum adaptive code length optimized model is given by the set of states S and the set of the corresponding parameter initialization values ϑ init a (s), s S. The set S is obtained by MCL quantization where the parameters α a (s) are defined by the MCL-initialization values: for some α (0, 1]. α a (s) = ϑ init a (s) + α, (5.37) The MCL-optimized model achieves the minimal code length for the training data using sequential coding probability estimation (2.28). For binary sources, MCLCQ can be solved via dynamic programing in the probability space analogously to that of MCECQ. However, the solution is not necessarily strictly optimal. Let the states {s} be indexed according to their probability estimates ˆp(0 s) = ϑ(0 ũñs 1 (s)) + α, (5.38) ñ s + 2α where α > 0 (a good choice is α = 1), such that ˆp(0 s 2 l) < ˆp(0 s k ), k > l. The set of values {ˆp k, k = 0, 1,..., S } is defined by (5.29) using (5.38). Let for each l, k [0, S ] : l < k, denote ϑ(0 s lk ) = s s ϑ(0 ũñs lk 1 (s)). Then the cost function is defined as C 1 (l, k] = ϕ(s lk ) = log 2 Γ(α 0 (l, k) + α 1 (l, k))γ(ϑ(0 s lk ) + α 0 (l, k))γ(ϑ(1 s lk ) + α 1 (l, k)) Γ(α 0 (l, k))γ(α 1 (l, k))γ(ϑ(0 s lk ) + ϑ(1 s lk ) + α 0 (l, k) + α 1 (l, k)) (5.39) where α 0 (l, k) and α 1 (l, k) reflect the prior on the probability p(0 s lk ) and can be calculated using the techniques described in Section 5.4. Fast calculation of (5.39) can be performed using Stirling s approximation (5.23). If multiple training data is used for each context, as required, e.g., for the MCL initialization, then the training set is denoted {ũñjs 1 (j, s); j = 1, 2,..., J ; s = 1, 2,..., S } (for each context s there are J training samples), and the cost function is defined by C 1 (l, k] = ϕ(s lk ) = j J log 2 Γ(α 0 (l, k) + α 1 (l, k))γ(ϑ(0 s j lk ) + α 0(l, k))γ(ϑ(1 s j lk ) + α 1(l, k)) Γ(α 0 (l, k))γ(α 1 (l, k))γ(ϑ(0 s j lk ) + ϑ(1 sj lk ) + α 0(l, k) + α 1 (l, k)) (5.40) where ϑ(a s j lk ) = s s ϑ(a ũñjs lk 1 (j, s)), a = 0, 1. One more benefit of MCLCQ over MCECQ is that it allows to determine the size of the output model automatically, taking the smallest S from those, which yield the shortest code length: { } S = min arg min {C z(0, S ]} (5.41) z=2,3,..., S

59 5.6. SUMMARY 53 (in general, there may be several values z resulting in the same minimal code length). We rewrite the procedure for MCECQ to implement MCLCQ with context initialization: /* initialization */ for (l = 0; l < S ; l + +) { for (k = l + 1; k S ; k + +) { find(α 0 (l, k), α 1 (l, k)); C 1 (l, k] := ϕ(s lk ) ; /* defined by (5.39) or (5.40) */ } } /* find S and C S (0, S ] */ S := 2; for (z = 2; z < S ; z + +) { for (k = z; k S ; k + +) { } C z (0, k] := min l l z (0, k] := arg min l {C z 1 (0, l] + C 1 (l, k]}; {C z 1 (0, l] + C 1 (l, k]}; } if (C z (0, S ] < C z 1 (0, S ]) S = z; /* extract thresholds {h k } */ h 0 := 0; h S := ˆp S ; k := S ; for (z = S ; z 2; z ) { k := l z (0, k]; h z 1 := ˆp k ; } output {h 0, h 1,..., h S }. This procedure finds an MCL-optimized model for the training set {ũñjs 1 (j, s); j = 1, 2,..., J ; s = 1, 2,..., S }, if α 0 (l, k) and α 1 (l, k)) are defined by the MCL criterion. 5.6 Summary In this chapter we addressed the problem of context modeling, which is one of the main challenges in the development of an efficient compression algorithm. The key step in the model design is the optimization process based on domain knowledge. This is done in the form of setting prior on the model parameters and reduction of the model size by merging states with similar statistics. The described optimization techniques are based on some sample data set. The use of the minimum adaptive code length

60 54 CHAPTER 5. CONTEXT MODELING FOR IMAGE COMPRESSION criterion allows to design a model, which yields the minimum possible code length for the training data set. The main problem of using optimized models is a possible statistical mismatch between the model and the data being coded. It may result in the increase of the code length compared to the non-optimized model. Thus, the optimization is most effective for specialized coding systems, such as, e.g., medical image compression algorithms. Optimization is a complex task, if the source alphabet is larger than 2. For binary sources, however, both initialization and quantization can be easily implemented. This suggests that the use of binarization techniques described in Chapter 3 allows for efficient model optimization for m-ary sources, m > 2. An open question for MCL context initialization is what should be the values N i defining the search range. A reasonable choice seems to be N i = max{ j J ϑj i, 255}, i = 0, 1, assuming the use of conventional implementation of an adaptive arithmetic coder with count scaling [61, 29]. Besides, the choice of a larger N i, i = 0, 1, increases the search time and may reduce the benefit of initialization, if the statistics of the message being coded is far from that of the sample set. The MCL-based optimization methods use the code length as a cost function. If the model is used in combination with a conventional adaptive arithmetic coding driven by the coding probabilities (2.28), (2.29), then the code length can be simply calculated based on symbol counts without performing actual encoding. However, standard implementation of an arithmetic coder, e.g., [61], may be too slow in some applications due to involved time consuming multiplication and division operations. Many multiplication free arithmetic coding algorithms have been proposed to speed up encoding process ([34, 21, 24]), which do not necessarily use symbol counts to produce the code. For example, the widely used QM- and MQ- binary arithmetic coders [33, 56] are implemented as a finite state machines, where the probability is defined by a state of the coder. In this case, the code length can only be determined by actual coding and the context model initialization can be performed by choosing an initial state, which yields the minimum code length for the training set. The described MCL model optimization has a tight relation with the MDL principle [6]. Yet, it is different in the sense, that in universal coding using the MDL principle, the choice is made during encoding and the model description is transmitted together with the corresponding code. In source coding with the MCL-optimization, the choice is made in advance, before the encoding, based on training data samples. Thus description of the model is avoided, since both the encoder and decoder know the model. This allows for more efficient compression. On the other hand, it may also reduce efficiency due to the possible statistical mismatch.

61 Chapter 6 Optimization of context models in the JPEG2000 standard The JPEG2000 image compression standard exploits two empirically designed context models for conditional entropy coding of bit-planes of wavelet transform coefficients. The models use the so called significance (binary) values of eight adjacent coefficients as a context template, which are mapped into 9 conditional contexts. This chapter addresses the problem of optimality of this approach. In other words, we answer the question: given the context template, is it possible to design models that would result in a better compression performance? For this purpose, we exploit optimization techniques for model design presented in Chapter Introduction It has been shown that JPEG2000, the recently adopted standard for image compression, among other advantages achieves high compression performance. This is obtained by utilizing a context-adaptive bit-plane embedded coding of wavelet transform coefficients [56]. In JPEG2000, the coefficients are coded in multiple passes, one bit plane per pass, starting from the most significant bit to the least significant bit in their binary representation. In each pass, the coder usually uses three coding primitives [56]: the significant bit coding, the sign bit coding and the refinement coding. The first primitive is invoked to code the coefficients, which have zero bits in the previous bit planes (insignificant coefficients). If the coded bit is 1, the coefficient becomes significant. The sign bit coding primitive is invoked right after the coding of the significant bit, if the bit is 1. The refinement coding primitive is used to code the bits of those coefficients, which are already significant. The significant bit coding primitive performs the main part of the compression. It was shown in [64] that efficient compression is determined mainly by entropy coding, rather than using good wavelets. Therefore, careful statistical context modeling for conditional entropy coding of significant bits is a task of primary importance. The JPEG2000 standard adopts two heuristically designed context models from the EBCOT algorithm [55]. The models use information of significance of eight neigh- 55

62 56 CHAPTER 6. OPTIMIZATION IN THE JPEG2000 STANDARD - input sequence Low-pass filter Decimation 2 High-pass filter Decimation 2 v = {v[x], x = 1, 2,..., X} L sub-band H sub-band - transform coefficients c = {c[x], x = 1, 2,..., X} Figure 6.1: A structure of one-dimensional decomposition. boring coefficients to define 9 conditional states. Using a small number of conditional states, the context dilution problem is avoided. On the other hand, the models may miss some statistical dependencies leading to inferior compression. This chapter has two main goals inspired by the model optimization methods presented in Chapter 5. The first goal is to investigate the compression performance of the heuristically designed models compared to optimized ones. The second goal is to use optimized high-order texture+energy models to achieve better compression. 6.2 Context modeling in JPEG2000 JPEG2000 uses a dyadic wavelet transform, which can be viewed as a decomposition into sub-bands via low- and high pass filtering. A one-dimensional decomposition of a sequence of samples v = {v[x], x = 1, 2,..., X} is schematically represented on Figure 6.1, where L and H denote the low- and high-pass subsets of the transform coefficients c = {c[x], x = 1, 2,..., X}, respectively. A 2-D decomposition is obtained by separate filtering of image rows and columns. The resulting structure of the transform coefficients of one-level decomposition of an image consists of four sub-bands denoted LL, LH, HL, and HH, as it is shown on Figure 6.2(a). This process is usually repeated for the LL sub-band, resulting in a multi-level decomposition. Let Z denote the number of wavelet decomposition levels, where the first decomposition gives the largest size sub-bands LL 1, LH 1, HL 1 and HH 1. The second level decomposition of the LL 1 sub-band gives the sub-bands LL 2, LH 2, HL 2 and HH 2 and so on. An example of a three-level decomposition is depicted on Figure 6.2(b). Let c be an r-bit wavelet coefficient and c i be the i-th bit of the binary representation of c (i = 0, 1,..., r 1, i = 0 corresponds to the least significant bit), let c...i denote an r-bit value, which has the same bits at the positions i, i + 1, i + 2,...,r 1 as the coefficient c and zeros at the positions 0, 1,..., i 1. The significance value of the coefficient c in bit-plane i is defined as

63 6.2. CONTEXT MODELING IN JPEG LL LH LL 3 LH 3 HL 3 HH 3 LH 2 HL 2 HH 2 LH 1 HL HH HL 1 HH 1 (a) Figure D wavelet transform sub-bands: a) one level decomposition; b) threelevel decomposition. (b) σ i (c) = { 1, if c...i > 0, 0, otherwise. The coefficient c is called significant in bit-plane i, if σ i (c) = 1, or insignificant, if σ i (c) = 0. According to this notation, in the coding pass of the i-th bit-plane, the significance coding primitive is invoked if σ i (c) = 0. The sign coding primitive is invoked if the coefficient becomes significant, i.e., if c i = 1, right after significance coding. The refinement coding primitive is invoked if σ i (c) = 1. The JPEG2000 raw context model is defined by the significance values of eight coefficients, adjacent to the one being coded. It has 256 states. The corresponding context template is shown on Figure 6.3. The raw states are mapped into 9 conditional contexts s S = {0, 1,..., 8}: s = Q{σ i (c w ), σ i (c n ), σ i (c nw ), σ i (c ne ), σ i+1 (c e ), σ i+1 (c s ), σ i+1 (c sw ), σ i+1 (c se )}, (6.1) where Q{ } is the mapping function (heuristic context quantizer). Indexes define (with respect to the scanning order 1 ) west, north, east, south, north-west north-east, southwest and south-east neighboring coefficients, respectively. Two functions Q{ } are implemented for different sub-bands resulting in two context models: one model for the LL, HL and LH sub-bands and another one for the HH sub-band. A detailed description of these functions is given in [56, 55]. However, we should note that these functions were found heuristically. In Chapter 5 the techniques for model optimization based on context initialization and quantization have been described. We apply these techniques to optimize the mapping (6.1) and to design high-order models using energy information. The statistical properties of the transform coefficients vary for different sub-bands. To make optimization more efficient, we divide the bit-planes and the sub-bands into 1 The scanning order for the sub-bands is the same as in the JPEG2000 standard.

64 58 CHAPTER 6. OPTIMIZATION IN THE JPEG2000 STANDARD σ i (c nw ) σ i (c n ) σ i (c ne ) σ i (c w ) c i =? σ i+1 (c e ) σ i+1 (c sw ) σ i+1 (c s ) σ i+1 (c se ) Figure 6.3: Context template used in JPEG and 4 classes, respectively, and essentially design a set of 7 4 = 28 models M = {1, 2,..., 28}. The classes are defined as follows. The bit-planes are enumerated in such a way that 0 corresponds to the least significant bit-plane and N corresponds to the most significant one within a sub-band. (In other words, N + 1 corresponds to the minimal number of bits required for a binary representation of absolute values of all the coefficients in a sub-band. N may vary for different sub-bands.) The bit-plane classes correspond to the bit planes Class 4 covers the bit-planes 4...N 2. Classes 5 and 6 include the (N 1)-th and N-th bit-planes, respectively. The sub-band classes are defined for a Z-level wavelet decomposition as follows: 1: LH Z...2 and HL Z...2 sub-bands; 2: HH Z...2 sub-bands; 3: LH 1 and HL 1 sub-bands; 4: HH 1 sub-band. We did not optimize the model for coding the LL Z band, since its contribution to the code length is negligible. 6.3 High-order context modeling The JPEG2000 raw model uses only the significance values of eight neighboring coefficients. This model basically captures texture statistics. However, the best compression

65 6.3. HIGH-ORDER CONTEXT MODELING 59 results were obtained by using combined texture + energy context models 2, see, e.g., [64]. We construct high-order models by adding the energy information estimated from the magnitude of the coefficients to the JPEG2000 raw model (the texture part). Direct use of the coefficients magnitude is not appropriate. Even though the coefficients are available in the quantized form (due to coding by bit-planes), the number of possible contexts normally is far too large (especially for the bit-planes, which are close to the least significant one). This may cause the context dilution problem, i.e., reduce the compression performance due to the high model cost 3. Thus, we need a more efficient approach to context formation. The main observation, which we used, is that the influence of the energy on the probability estimates strongly depends on the texture context: some texture patterns occur so rare that it is not possible to collect enough statistics to use a higher order model efficiently. Thus, for contexts with such a texture, we simply omit information about the energy. Furthermore, this information varies for different bit-planes, e.g., at bit-planes, which are close to the most significant one, very little is known about the magnitude of the neighbors. From these considerations we form the high-order contexts as follows. For the bit-plane classes 5 and 6 no energy information is included in the context. For the bit-plane class 6, the context is represented by a binary vector of the significance values of four causal coefficients: σ i (c w ), σ i (c n ), σ i (c nw ) and σ i (c ne ), and for the bit plane class 5, the context is a vector of the significance values of all eight neighbors: σ i (c w ), σ i (c n ), σ i (c nw ), σ i (c ne ), σ i+1 (c s ), σ i+1 (c e ), σ i+1 (c sw ) and σ i+1 (c se ), where i is the bit plane being coded. For the bit-plane classes 0...4, energy information is added to the texture pattern to form a compound context, as will be specified in the following paragraphs. The texture class T is defined as the number of significant coefficients among the neighbors. To define the energy vector E i = {e i 1, ei 2,...,ei T }, where i = {0, 1,..., 6} specifies the bit-plane class, let b(c) be the bit-plane at which the coefficient c has become significant. Then for each significant coefficient from the context template, the values e i k of the energy vector are estimated in the form of a quantized coefficient magnitude as follows: { b(c) i ν, if b(c) i ν < e i Mi (k), k = M i (k), if b(c) i ν M i (k), k = 1, 2,..., T, where ν = { 0, for cw, c n, c nw and c ne, 1, for c e, c s, c sw and c se, 2 The number of significant coefficients in the context template can be viewed as a rough estimate of the local energy. Yet, by energy we assume more fine information defined further in the section. 3 Even though the optimization is supposed to handle this problem, the lack of contexts statistics may reduce its efficiency.

66 60 CHAPTER 6. OPTIMIZATION IN THE JPEG2000 STANDARD Texture Bit-plane classes, i class, T no no no no no no no no no no no no no no no no no no no no no no no no no no no no 7 1 no no no no no no 8 1 no no no no no no Table 6.1. Magnitude quantization parameter M i (T) for different texture and bitplane classes. Sub-band Bit-plane classes, i class Table 6.2: Sizes of the high-order raw models. and M i (k) depends on the bit-plane and texture classes, i and k, respectively. The values M i (k), k = 0, 1, 2,..., T, are listed in Table 6.1, where no means that no energy information is used for this texture class. The values e i k basically define the position of the first nonzero bit (starting from the N-th bit-plane) relative to the bit-plane i in the binary representation of the coefficient. The introduction of ν is justified by the fact that in raster scan coding order, the bits, corresponding to the current bit plane, are not available yet for the coefficients c e, c s, c sw and c se. The resulting raw model is formed as a product of all possible texture patterns (256 in total) and the corresponding energy vectors. Even though its design is somewhat ad-hoc, the model allows to capture high-order statistics. The number of states of the raw models for a given bit-plane class i can be calculated by S i = T=0 ( ) 8 (M i (T) + 1) T. (6.2) T The corresponding sizes are given in Table 6.2. The high-order raw models were optimized using the techniques described in Chapter 5 including both context initialization and quantization. Doing this, we assume the use of the conventional arithmetic coder that exploits the coding probabilities (2.28) (whereas JPEG2000 exploits the MQ-coder [56]).

67 6.4. EXPERIMENTAL RESULTS 61 Image Heuristic Optimized models models 9 states 9 states 64 states optimal no init. no init. baloon barb barb board boats girl gold hotel zelda lena baboon tools Average bit rates (+0.06%) (-0.16%) (-0.22%) (-0.28%) (-0.28%) Average bit rates for the training set (+0.07%) (-0.19%) (-0.27%) (-0.37%) (-0.38%) Table 6.3. Code lengths in bytes and average bit rates in bits per pixel (in the lossless mode) for the significant bits for the heuristic JPEG2000 models and the models of different size obtained by optimizing the JPEG2000 raw model (256 raw contexts) for the set of test images and the training set. 6.4 Experimental results The models obtained by the optimized JPEG2000 raw model (Section 6.2) and the optimized high-order models (Section 6.3) were tested on 12 standard test images to verify its efficiency. We compared their compression performance with that of the models used in the JPEG2000 standard. To make the comparison fair, the JPEG2000 models and the optimized models were used in the same settings: the same dyadic reversible wavelet transform using the 5/3 filter [56], the same arithmetic coder from [61] and the same (raster) scanning order of the wavelet transform sub-bands. For simplicity, we did not use block-wise coding implemented in JPEG2000, rather we separated coding passes by the sub-bands. The training set used for optimization consisted of 60 natural 8-bit images. Images used for test compression were chosen from outside the training set. Tables 6.3 and 6.4 show the results obtained by optimizing the JPEG2000 raw model and the highorder models, respectively. The number of states defines the maximal model size for optimized models. Optimal means that the size of the models is optimal for the

68 62 CHAPTER 6. OPTIMIZATION IN THE JPEG2000 STANDARD Image Heuristic Optimized high-order models models 9 states 64 states optimal baloon barb barb board boats girl gold hotel zelda lena baboon tools Average bit rates (-0.31%) (-0.363%) (-0.34%) Average bit rates for the training set (-0.47%) (-0.59%) (-0.63%) Table 6.4. Code lengths in bytes and average bit rates in bits per pixel (in the lossless mode) for the significant bits for the initialized heuristic JPEG2000 models and the high-order optimized models of different size for the set of test images and the training set. training set (the sizes vary, yet optimal, for different sub-band and bit-plane classes). Table 6.3 contains also the results obtained with the (optimally) initialized and noninitialized (the no init. column) JPEG2000 models. Figures in the tables are the number of bytes and average bit rates spent for coding significant bits in the lossless mode. The reader can see that even though the optimized models consistently outperform the JPEG2000 models, the improvement is only marginal. The main reason is that the coding is performed mainly in contexts, where most of the neighbors (> 4...5) are insignificant. In this case, conditioning by the coefficients magnitude can not significantly improve the performance. To achieve better compression, one has to enlarge the context template to be able to use more information for conditional probability estimates. 6.5 Summary The conclusion is that the JPEG2000 models are almost optimal for the given context template over a broad class of natural images. The optimized models result in a

69 6.5. SUMMARY 63 very small improvement (about 0.3%) over the heuristic models for the class of natural images. The use of energy information in addition to the texture, defined by the significance of the neighboring coefficients, improves the coding efficiency only marginally. More efficient models can be designed only taking a larger context template.

70 64 CHAPTER 6. OPTIMIZATION IN THE JPEG2000 STANDARD

71 Chapter 7 Hierarchical modeling via optimal context quantization In this chapter we propose methods for a double adaptive context modeling based on the context quantization technique described in Chapter Introduction So far we considered the problem of designing a parametric set of models Ω for universal coding of images (Chapter 5). However, a double adaptive set of models allows for a better approximation of the real source due to a more wide range of the set Ω. Given a parametric FSM model M of size S, the code length of a message of length n is lower bounded by ϕ(u n 1 ) nh M + K 2 log 2 n + O(1), (7.1) where H M is the per-symbol entropy of the source defined by the model M and K = (m 1) S is the number of free parameters. It is clear from this expression, that optimal model size (complexity) may vary for messages of different lengths. The use of a double adaptive set of models of different complexity may overcome this problem. On the other hand, multi-model coding is more complicated, since it usually requires more memory space and coding time. Furthermore, the code length of a double adaptive coding includes the multi-model redundancy term denoted R(Ω) such that the code length becomes [45, 13] ϕ(u n 1 ) R(Ω) + nh M + K 2 log 2 n + O(1), (7.2) Informally, R(Ω) can be understood as an additional information required to specify the particular model structure from the set, while the term K 2 log 2 n reflects the lack of knowledge on the parameters, given the structure. If the set of structures is reasonably small, then this approach may achieve a smaller overall code length and, thus, a better compression performance. 65

72 66 CHAPTER 7. HIERARCHICAL MODELING The drawback of conventional tree structured sets of models, such as the sets of FSMX or CT models, is that the tree structure corresponds to the lexicographical order of source symbols constituting the state. Specifically, it assumes that strings with the common suffix of some length l 0, have close probability distributions, and the longer the suffix, the closer the distributions. This enables to merge such states and use common statistics, and hence, reduce the model cost. It seems to be natural for text-like sources or one dimensional data. Yet in general, for multi-dimensional data, this assumption is not valid. In such cases, the context tree may have many leaves with similar statistics, which are not siblings and even can be in distant branches of the tree. Especially this is true for digital images. It is usually very difficult to find a rule or a structure for context merging for multi-dimensional data. The design of a set of tree-structured models also raises the question of symbol ordering in the context. This problem has been discussed in [57], where the authors used an ad-hoc approach to arrange the context symbols. Instead of heuristic ordering, context quantization allows for an efficient grouping, which is not restricted by any structure. Optimal merging is based on the statistics from some training set. 7.2 Two-stage quantization Context quantization introduced in Chapter 5 allows to find a parametric FSM model with fixed complexity. The MCL criterion also allows to get the model of optimal size for the set of training data samples. However, the quantization procedure allows to extract a model of arbitrary size, which is optimal with respect to this size for the training data. We use this property to design a double adaptive model set. This is achieved by performing two-stage quantization: the first one is a normal MCECQ or MCLCQ described in Section 5.5, where S is chosen comparatively large (but smaller than the number of raw contexts before quantization S ) and the second quantization is performed based on the statistics of the data in hand. The pre-knowledge given by the training set allows for the first significant reduction of the model complexity. Once we have primary partition (quantization) of the contexts, the model is further reduced individually for the given data using its statistics. It has two main goals: finding a model of an optimal complexity (the number of states) and adjusting the quantizer to the given data. In other words, the second quantization can also correct possible statistical mismatch between the training set and the input data. The secondary quantization can be done in the two-pass coding or the right model can be estimated sequentially in a predictive way. In the first case, the model description R(Ω) must be sent to the decoder as side information. Since the number of states after the first quantization is comparatively small, this information is expected to be negligible. There are several ways to perform the secondary quantization. One can use the

73 7.3. TREE-STRUCTURED QUANTIZATION 67 same technique for optimal context quantization based on statistics of the data in hand using the states obtained after the first quantization as an input to the second quantization. This would give an optimal solution. For binary sources, an efficient, but suboptimal solution is possible using a treestructured quantization. This approach exploits the fact that in context quantization the states are ordered by their probability distributions. Since it seems to be less complex for implementation and allows for efficient quantization algorithms, we shall describe the tree structured quantization in the next sections and present some experimental results. 7.3 Tree-structured quantization Given the training data, context quantization algorithms yield an optimal clustering of contexts with the number of clusters M = S. This number is set in advance and defines the size of the FSM model M M. Let T M be a binary tree with M leaves. We define the order of the leaves according to the lexicographical order of their binary paths from the root to the leaf. Let each leaf of the tree correspond to a state of the model M M in such a way that the order of states corresponds to the order of leaves. Since the states are ordered by their corresponding probability distributions P s (A), this results in a tree structured scalar quantizer of contexts in the probability space. Thus, any pruned subtree τ T M defines some secondary clustering (quantization) of the initial context partition M M and defines a new model (normally, with a smaller number of states). The number of pruned trees is a double exponent of M. It gives a large number of models, whereas the storage requirements and the time needed to find the optimal tree are linear in M. This way we reverted to a tree-structured set of models. However, now the leaves of the tree are ordered by the corresponding probability distributions. Hence, it is more natural now to merge some siblings, if their statistics is similar. The tree-structured quantizer has some advantages and drawbacks. The main advantage is that the tree offers efficient algorithms for optimal pruning. The drawback is that this solution is suboptimal. 7.4 Optimal tree pruning Given a message of length n, optimal pruning is based on the statistics of the data to be coded and it is performed using minimum adaptive code length criterion (cost function) in the two-pass manner. During pre-coding the symbol counts are collected for each node of the tree. Based on the counts the adaptive code length for the nodes can be calculated using the method described in [27]. Let s be a node of the tree T M, s0 and s1 be sons of this node 1. For each node s 1 We use the same symbol to define a node of the tree and a state of a FSM model, as in Section 7.3, since the tree is just a restricted version of a FSM model, where the set of nodes defines the set of states of the corresponding FSM model.

74 68 CHAPTER 7. HIERARCHICAL MODELING Image Image size Optimal model Fixed size model, Hierarchical model, size bytes bytes baloon 720x barb 720x barb2 720x board 720x boats 720x girl 720x gold 720x hotel 720x zelda 720x lena 512x baboon 512x tools 1200x Table 7.1. The compression performance of the hierarchical model versus fixed optimal model. of the tree, which is not a leaf, its cost function is recursively calculated as: and for the leaves C(s) = min{c (s), C(s0) + C(s1)}, C(s) = C (s), where C (s) is the adaptive code length, calculated based on counts at the node s, C(s0) and C(s1) are the cost functions for the sons. Then the tree is traversed from the root and pruned up to the node s, if C (s) C(s0) + C(s1). The resulting tree is used to code the data. It can be easily verified that this pruning algorithm results in a tree, the leaves of which (the resulting clustering) yield the minimum code length for the given data. The search time for optimal pruned tree is linear in the size of the full tree, thus, it is time efficient algorithm. The tree description must be transmitted to the decoder as side information. The description length is proportional to the size of the found model and it is normally negligible compared to the length of the coded data. 7.5 Experimental results The described modeling was implemented for coding significance bits of wavelet transform coefficients. This technique was described in Chapter 6. In our experiments, eight neighboring coefficients were used to condition the probability distribution for the bit being coded. The coefficients were quantized to a 2-bit value to form 16-bit contexts, resulting in the FSM model with 2 16 = conditional states. This kind of context formation allows for conditioning the probability not only by the pattern texture, but also by its energy. We applied this modeling to code

75 7.6. SUMMARY 69 significant bits of the HL 1 and LH 1 bands of the first decomposition level of wavelet transform. For images of ordinary size the resulting model is far too large. Therefore, the contexts were optimally quantized w.r.t. minimum conditional entropy via dynamic programming using statistics from the training set. We tried two different modeling approaches: fixed model and multi-model. In the first experiment, the number of states after context quantization was determined individually for each image by the minimum code length. The results are given in Table 7.1. The second column contains the size of images. The third column shows the best model size (in terms of compression performance). The number of bytes of the coded data is given in the fourth column. In the second experiment, we used the described approach. First, the model with 256 states has been found via dynamic programming. (In both experiments we used the same training data set for clustering by context quantization. Images, which have been used for test coding, were taken from outside the training set.) Then this model has been individually quantized for each image via tree pruning algorithm described in Section 7.4 and sent to the decoder as a header. The fifth column gives the number of bytes of the coded data together with the tree description information. The figures show that the described approach allows for finding the right model size based on the code length and in most cases performs even better. This is due to the fact that the second quantization was performed based on the statistics of the data to be coded. Thus, the described algorithm allows for efficient tuning the model to the actual data. 7.6 Summary We proposed a new (multi-) modeling approach based on a technique of optimal context quantization. This method is primarily aimed at the data, the underlying source of which is described by the FSM model. The presented double adaptive coding method allows for efficient use of some preknowledge of the source as well as the individual statistics of a given data. Experiments show that the proposed approach yields not only the optimal model size, it also corrects the context quantization for a better fit to the data. This results in a superior compression performance. On the other hand, this method may require more memory space and time than conventional methods. In some applications this may be unacceptable trade-off between complexity and performance.

76 70 CHAPTER 7. HIERARCHICAL MODELING

77 Chapter 8 Compression of medical images This presents an algorithm for compression of medical images, which allows for progressive near-lossless as well as lossless image coding. The method is based on a lossy plus near-lossless layered compression scheme and embedded quantization of the difference signal. We show, that this technique allows for a better image quality and compression performance for large tolerance values, than algorithms based on predictive coding. To achieve high compression performance, we exploit the methods developed in previous chapters: binary decomposition of quantization indices and optimization of context models used for conditional entropy coding of binary decisions. We also describe a method for image reconstruction with the minimum mean-square error (MSE) criterion. The method is an extension of near-lossless coding and allows to reduce MSE, while providing a strict bound on the maximum absolute difference error. 8.1 Introduction Compression of medical images faces severe requirements to the quality of the reconstructed data. Ideally, the decoded image should be identical to the original one, meaning lossless compression. However, lossless coding allows for rather modest compression ratios, ranging normally from 1.5 to 3. Lossy coding yields much higher compression ratios by allowing some distortion in the reconstructed image. Near-lossless compression is a lossy image coding algorithm, where the distortion is specified by the maximum allowable absolute difference (the tolerance value) between pixel values of the original and the reconstructed images (in the literature, it is also called L -constrained lossy compression [65]). The method was introduced in [10]. At present, lossy algorithms normally are not being used in clinical practice in the US due to legal questions and regulatory policies. The physicians and radiologists are concerned with the legal consequences of incorrect diagnosis based on a lossy compressed image [62]. On the other hand, lossy algorithms allow for much higher compression rates compared to lossless coding while being able to preserve the diagnostic information. An extensive research is being carried out to develop reasonable policies and acceptable standards for the use of lossy processing on medical images. Apparently, algorithms intended for compression of medical images should be able to 71

78 72 CHAPTER 8. COMPRESSION OF MEDICAL IMAGES manage and control the introduced distortion in an efficient way. From this point of view, near-lossless compression seems to be the most suitable method for medical imaging. Near-lossless compression is supported by the recent JPEG-LS standard for lossless image coding [60]. Yet, a desirable feature of a compression algorithm is the ability to produce an embedded bit stream, meaning that any initial part of it can be used to reconstruct the image with a fidelity in proportion to the length of this part. The JPEG2000 standard [56] is an example of a sophisticated method, which implements this idea, allowing for progressive quality and spatial resolution coding (up to the lossless coding). However, algorithms that produce an embedded bit stream do not allow for efficient control of distortion in L (near-lossless) sense. A method for progressive near-lossless coding can be derived from a conventional near-lossless algorithm by exploiting an embedded quantization technique. For the time being, the best results have been reported for the algorithms based on predictive coding (see, e.g., [65]). An efficient algorithm implementing progressive near-lossless compression was proposed in [5]. However, methods based on predictive coding turn out to be inefficient, if the tolerance is comparatively large (> 3 for 8-bit images). This is caused by the increasing noise of prediction errors, resulting in an inferior performance of the prediction based on inaccurate data. Another approach is a lossy plus refinement coding (see, e.g., [4]). In this algorithm, the quantized difference (residual) signal is added to the lossy compressed version (the lossy layer) to produce an image, which satisfies the specified near-lossless tolerance. In this chapter we show, that algorithms based on this technique allow for a better compression performance and higher objective and subjective quality of the reconstructed image for tolerance values > 3, than methods based on predictive coding. The reason is that the lossy layer preserves smooth regions well and the near-lossless correction is mainly required in the areas with high edge activity. Figure 8.1 shows the same image (a) compressed by the JPEG-LS standard (b) and the proposed layered method (c) with the tolerance value 13. One can see the specific distortion introduced by JPEG-LS. An interesting feature of a near-lossless coding is the possibility of two-mode reconstruction: given a part of the bit stream, corresponding to some tolerance value, one can reconstruct the image with maximum absolute difference error or minimum mean-square error (MSE) criterion. In the latter case, the maximum absolute difference error is also bounded (and known in advance, which is essential), being, however, somewhat larger than the specified tolerance value. This suggests, that the lossy plus refinement approach is better for the design of an algorithm intended for progressive near-lossless compression than predictive coding algorithms. Another advantage of this technique is that it can be built on top of any existing lossy compression algorithm that allows to use all its functionality at bit rates lower than that needed for near-lossless reconstruction.

79 8.1. INTRODUCTION 73 (a) (b) (c) Figure 8.1. An example of near-losslessly compressed MR image with the tolerance value 13: (a) original image; (b) image produced by the JPEG-LS standard; (c) image produced by the proposing lossy plus refinement algorithm. In this chapter, we extend the idea of lossy plus refinement coding [4] to achieve progressive near-lossless image compression via embedded scalar quantization of the difference signal. We also describe an efficient method for coding refinement information based on binary decomposition and the reconstruction technique minimizing MSE. The compression performance of the proposed algorithm was verified on two widely used types of medical images: computer tomography (CT) and magnetic resonance (MR) images. We should note, that due to the discreteness of the distortion measure (we only allow the tolerance to assume integer values), the method yields a quasi-embedded bit

80 74 CHAPTER 8. COMPRESSION OF MEDICAL IMAGES δ r 0 r 1 r 2 r 3 R Figure 8.2. Example of the rate-distortion function of the near-lossless embedded coding defined by 4-layered refinement with δ 0 = 13, δ 1 = 4, δ 2 = 1 and δ 3 = 0. stream in a sense that the (L ) distortion changes step-wise, rather than smoothly, as the number of decoded bits increases. Figure 8.2 shows an example of the ratedistortion function of the proposed method (reconstruction with the tolerance δ 0 = 13 corresponds to the rate r 0 and lossless reconstruction corresponds to the rate r 3 ). 8.2 Embedded near-lossless quantization An embedded quantization of the difference signal between the original and the reconstructed image is an essential part of the algorithm. Its design differs from the conventional approach (see, e.g., [56]), because the values to be quantized are integers. In conventional embedded quantization, the intervals of a lower rate quantizer are partitioned to form the intervals of a higher rate quantizer. Normally, such a quantizer is designed in a forward way by specifying the quantization intervals of the lowest rate quantizer, the partitioning rule and the quantizer output values [56]. If the values to be quantized are integers, the intervals are represented by a set of consecutive integer values, one of which is used as the output value. It imposes restrictions on the partitioning and the choice of the output values due to the following reasons. First, the finite number of integers in the interval allows only for finitely many refinement layers (the interval can be partitioned finitely many times) and this number depends on the length of the intervals of the lowest rate quantizer. The highest rate quantizer indices normally correspond to the set of input symbols (no quantization). Secondly, the partition may result in intervals of different length, i.e., in non-uniform quantization. Thus, the embedded quantizer of integers should be designed in a backward way, i.e., merging intervals, instead of partitioning, starting from the highest rate quantizer. This process should be repeated recursively until a desirable lowest rate quantizer is reached. To achieve near-lossless embedding, the merging is performed according to the

81 8.2. EMBEDDED NEAR-LOSSLESS QUANTIZATION 75 I 2 =... I 0 I 1 I I I 0 = d Figure 8.3. The near-lossless quantization tree of a 3-layer quantizer defined by 1 = 2 = 1. standard near-lossless quantization rule. Let v be the pixel s value of the original image, ˆv be the pixel s value of the lossy layer and d = v ˆv be the difference signal. Let I i Z be the quantization index of the refinement layer (quantizer) i = 0, 1, 2,..., and let I 0 be the quantization index of the highest rate quantizer (the unquantized difference value, i.e., I 0 = d). Given integer values i > 0, i = 1, 2,..., the lower rate quantization index I i is recursively derived from the higher rate index I i 1 by Ii 1 + sign(i i 1 ) i I i =, i = 1, 2,..., (8.1) 2 i + 1 where denotes the integer part of the argument. Expression (8.1) defines successive uniform and symmetric merging (around zero). The indices I i correspond to the near-losslessly quantized values d with the tolerance δ i, which is recursively defined by i : { 0, i = 0; δ i = (8.2) 2δ i 1 i + δ i 1 + i, i = 1, 2,... in other words, d + sign(d)δi I i = Q(d) =, (8.3) 2δ i + 1 where Q(d) defines the quantization operator of the variable d, and δ i is defined by (8.2). The values i define a degree of embedding, i.e., the smaller i, the finer the quantizer. Clearly, the finest quantizer is achieved by setting i = 1, i = 1, 2,.... The sequential merging of indices can be described by a tree. Figure 8.3 shows such a tree for a 3-layer quantization, where 1 = 2 = 1. According to (8.2), the tolerance value corresponding to the quantizer i depends on that of the quantizer i 1 and the parameter i. Thus, the requirement of embedding imposes restrictions on the sequence of allowable tolerance values. The finest 4-layer embedded quantization defines the following sequence of the tolerance values: 0, 1, 4, 13. Yet, in practice in some cases, refinement layers with the tolerance, say, 0, 2, 7, 22 (corresponding to the choice 1 = 2, 2 = 1, 3 = 1) may be more appropriate

82 76 CHAPTER 8. COMPRESSION OF MEDICAL IMAGES than with the tolerance 0, 1, 4, 13. By choosing the parameters i, one can design quantizers with different degree of embedding depending on application requirements. Such backward design is necessary for determination of the tolerance values, which allow for embedded quantization. The progressive coding (and quantization) is performed, however, starting from the lowest rate quantizer. For convenience we index the quantizers and the corresponding tolerance values starting from the lowest rate quantizer such that for an N-layered quantization δ 0 corresponds to the largest tolerance value (δ 0 = 13 for a 4-layer quantization defined by 1 = 2 = 3 = 1) and δ N 1 = 0. Quantization indices of the first approximation layer (i = 0) are calculated according to the standard formula: d + sign(d)δ0 I 0 =, (8.4) 2δ where d = v ˆv. The corresponding reconstruction value is defined by ˆv 0 = ˆv + I 0 δ 0. (8.5) Quantization indices of the refinement layers i = 1, 2,..., N 1 are calculated as follows: ˆdi 1 + sign( I i = ˆd i 1 )δ i, (8.6) 2δ i + 1 where ˆd i = d + i k=0 I kδ k. Since ˆd i δ i, the indices I i>0 take values in the range [ i, i ] (unlike indices I). A sequence of quantization indices I = I 0 I 1...I j, 0 j N 1, describes the difference value with the precision δ j and defines a path in the quantization tree, see Figure 8.3. The tree structured description of the difference value guarantees the optimality of the embedded coding [12] in L sense. Roughly speaking, it means that coding of a difference value with the precision δ j via a sequence of indices I 0 I 1...I j takes the same number of bits as if we directly coded the values 8.3 Entropy coding of the refinement layers d+sign(d)δj 2δ j +1 We describe entropy coding and context modeling for a four layered progressive compression. In the following we assume that i = 1, i = 1, 2,... (i.e., we deal with the finest embedded quantizer). The entropy coding technique of the first near-lossless refinement layer differs from the coding of other refinement layers. This layer constitutes the difference signal, quantized with the largest tolerance value. The potential range of the quantization indices is defined by the interval [ 2 g +1 δ 0 2δ 0 +1, 2 g 1+δ 0 2δ ], where g is the number of bits used for binary representation of pixels. For example, for δ 0 = 13 and 12-bit data (which is often the case for medical images) the indices take values from the interval [ 152, 152]. For the subsequent layers, the refinement information corresponds to only three symbols { 1, 0, 1} calculated by (8.6).

83 8.3. ENTROPY CODING OF THE REFINEMENT LAYERS µ i 1 µ i 2 a 0 a 1 a 2 Figure 8.4: The binary decomposition tree of a ternary alphabet. The probability mass function of the difference values can be approximated by the generalized two-sided geometric distribution (GTSGD) introduced in Chapter 4. Thus, for efficient encoding we use the technique based on binary decomposition described there. The coding of the first refinement layer is performed according to the decomposition tree B shown on Figure 4.3. The reason for such a choice is that even though this decomposition is slightly worse than the decomposition A for low entropy sources, nevertheless, it allows for explicit context modeling of the sign decision. In the subsequent layers, the quantization indices take values from a ternary alphabet 1 : I 1,2,3 { 1, 0, 1}. To make the coder compatible with binary arithmetic coding, we also perform binary decomposition of the alphabet. The decomposition tree is shown on Figure 8.4, where µ i 1 and µi 2 denote the decision nodes (the superscript i indicates the corresponding refinement layer). By a 0, a 1 and a 2 we assume symbols from the set { 1, 0, 1}. The specific assignment is made based on the difference value quantized with the precision of the previous layer, i.e., based on the value I = (x ˆx)+sign(x ˆx)δi 1 2δ i 1. To reduce the number of decisions and the coding time, the symbol assigned to a 0 should have the largest probability (clearly, the particular assignment of the least probable ones to a 1 and a 2 does not matter). Thus, if I = 0, the most probable symbol is 0 and the assignment corresponds to {a 0 = 0, a 1 = 1, a 2 = 1}. If I > 0, the most probable symbols is 1 and the corresponding assignment is {a 0 = 1, a 1 = 0, a 2 = 1}. Otherwise {a 0 = 1, a 1 = 0, a 2 = 1}. The following C-code implements index encoding by the decomposition tree on Figure 8.4: if (I i == a 0 ) encode(0, µ i 1 ); else { encode(1, µ i 1 ); if (I i == a 1 ) encode(0, µ i 2 ); else encode(1, µ i 2); } Encoding of the decisions of some refinement layer is conditioned by the decisions made in the same spatial position on the previous layers. Thus, the sequence of quantization indices I 0 I 1... is coded as a Markov chain that ensures optimality of successive refinement of information [12]. 1 Note, that in the general case, for the i-th refinement layer, it would be symbols from the set { i, i + 1,...,0,..., i 1, i }, see Section 8.2.

84 78 CHAPTER 8. COMPRESSION OF MEDICAL IMAGES Moreover, the decisions reveal correlation with decisions made in the neighboring positions. Thus, the coding of the binary decisions at each node η (µ i 1,2) is performed with the probability distribution conditioned by the decomposition tree node and the context: p(0) = p(0 η (µ i 1,2), s). (8.7) The set of contexts {s} = S define a statistical model M for a decision at the node η (µ 1,2 ). Context models should be designed individually for each node of the decomposition tree and each layer, since the statistics may vary and have different correlation properties to the context information. However, most of the decisions are done in the tree, the root of which corresponds to the quantization index I 0 = 0 of the first refinement layer, see Figure 8.3. We shall refer to it as T 0. The contribution of other trees to the resulting code length is negligible and the context modeling does not give any noticeable improvement of the compression performance. Moreover, the lack of statistics at those nodes reduces coding efficiency due to the high model cost. From these considerations, we designed models only for the nodes of the tree T 0. Context models were designed using the texture + energy principle described in Section 5.2. To form the contexts, we use two kinds of context information: the quantized residual signal ˆr = i k=0 I kδ k (where i is the current refinement layer) and the reconstructed image (for the encoder, this is the sequentially updated lossy layer). The residual signal is used to extract the texture information and the reconstructed image is exploited to calculate the local energy 2. Let T and E denote the texture and energy contexts, respectively. The texture contexts are defined individually for different nodes of the refinement tree, whereas the energy context is defined in the same way for all the nodes. The energy context is calculated via linear filtering followed by quantization (to reduce the number of contexts). The unquantized value is calculated as E = 2ˆv[y, x] ˆv[y 1, x] ˆv[y, x 1] ˆv[y + 1, x] ˆv[y, x + 1], (8.8) where y and x define the current spatial position of the index being coded (the row and the column, respectively). Equation (8.8) defines a simple high-pass filter. It gives an estimate of convexity/concavity of the image in horizontal and vertical directions at the location of the symbol being coded. The quantization rule is defined as 3 0, if 0 E < 2; E = sign(e ) log 2 E, if 2 E < 128; sign(e )7, if E 128. (8.9) 2 The energy can be as well estimated from the residual signal. Yet, in our experiments the use of the reconstructed image resulted in a better compression performance. 3 The quantization rule can be optimized via dynamic programing [64]. However, we found out that the rule (8.9) basically yields the same compression performance as the optimal quantizer.

85 8.3. ENTROPY CODING OF THE REFINEMENT LAYERS 79 To form the texture contexts, we use the sign value defined as 0, if ˆr[y, x] = 0; ξ[y, x] = 1, if ˆr[y, x] > 0; 2, if ˆr[y, x] < 0; (8.10) and the significance value σ[y, x] = { 0, if ˆr[y, x] = 0; 1, if ˆr[y, x] 0. (8.11) For the first refinement layer we designed two types of contexts, one for the zero/nonzero decision (node η 0, see Figure 4.3) and another one for the sign decision (node η 1 ). No modeling is performed for the nodes η 2,3. The texture context for the node η 0 is formed as a binary vector where T (η 0 ) = b 0 b 1 b 2 b 3 b 4 b 5 b 6 b 7 (8.12) b 0 = σ[y, x 1]; b 1 = σ[y 1, x 1]; b 2 = σ[y 1, x]; b 3 = σ[y 1, x + 1]; b 4 = σ[y, x 2] σ[y, x 3] σ[y 1, x 2] σ[y 1, x 3]; b 5 = σ[y 2, x 2] σ[y 2, x 3] σ[y 3, x 2] σ[y 3, x 3]; b 6 = σ[y 2, x 1] σ[y 2, x] σ[y 3, x 1] σ[y 3, x]; b 7 = σ[y 2, x + 1] σ[y 2, x + 2] σ[y 3, x + 1] σ[y 3, x + 2]; (8.13) are the binary values ( is the binary or operation). The template for this texture context includes 20 neighboring significance values. The calculation of the values b 4...b 7 can be viewed as a pre-quantization of a 20-bit context into the 8-bit one. The texture context for the sign decision uses the sign values of the two causal nearest neighbors. It is calculated as T (η 1 ) = 3ξ[y, x 1] + ξ[y 1, x]. (8.14) The texture models for the second layer were also designed for only two decisions: zero/non-zero (µ 1 1) and sign (µ 1 2). For the node µ 1 1, it is formed as the vector (8.12), where the binary values are defined by the significance values of the eight adjacent neighbors. For the node µ 1 2, the texture context uses the sign values of 6 neighbors: ξ[y, x 1], ξ[y 1, x 1], ξ[y 1, x], ξ[y 1, x + 1], ξ[y, x + 1], and ξ[y + 1, x]. This results in 3 6 = 729 contexts. To reduce the number of contexts, we use an assumption of symmetry of conditional distributions: P(+ s) = P( s); P( s) = P(+ s), (8.15)

86 80 CHAPTER 8. COMPRESSION OF MEDICAL IMAGES where s denotes the contexts, where all the signs are inverted. This reduces the number of contexts by one half. The third and the forth layers describe essentially the noise signal. We designed simpler models for these layers. The texture context for the decision at the node µ 2 1 is formed as a binary vector of significance values of four neighbors T (µ 2 1) = b 0 b 1 b 2 b 3, where b 0 = σ[y, x 1]; b 1 = σ[y 1, x 1]; b 2 = σ[y 1, x]; b 3 = σ[y 1, x + 1]; (8.16) The texture context for the sign decision (which is the node µ 2 2, if I 1 = 0) uses the sign values of the same neighbors. There is no context modeling for the nodes µ 2 2, if I 1 0. No texture context is defined for the last refinement layer for the decision at the nodes µ 3 1. It uses only energy context for conditional entropy coding. The texture context for the sign decision (the node µ 3 2, if I 1 = I 2 = 0) is formed similarly to that of the third layer. The resulting models are formed as a Cartesian product of the texture and the energy contexts: s = T E. To reduce the model cost and avoid context dilution, the high-order models of the first two layers were optimized using the MCL techniques described in Chapter Experimental results The compression performance of the proposed technique was verified on 10 sets of frequently used types of medical images: 5 sets of computer tomography (CT) and 5 sets of magnetic resonance (MR) images. They were taken from different sources: 3D- Lab 4, RSNA96 5 and Philips test images 6. Each set was made up of images acquired by the same device and with similar content (mostly subsets of volumetric data, although not necessarily). Specifications of the image sets are listed in Table 8.1. We used the JPEG2000 image compression standard (Kakadu software implementation 7 ) to produce the lossy layer. Of course, the compression performance of the proposed method depends on the quality of the lossy compressed image. For each image there is an optimal number of bits needed for coding the lossy layer, which gives the best overall compression performance [4]. We tried two ways of lossy layer quality specification: individual optimal lossy layer rate for each image and the optimal rate for the set. We will show, that for images of the same type the difference in compression performance is negligible. It 4 (images are not publicly available) 5 ftp://ftp.erl.wustl.edu/pub/dicom/images/version3/rsna96/ 6 ftp://ftp-wjq.philips.com/medical/interoperability/out/medical Images/CTaura CTsecura/ (images s s2 45) 7 Executables are available at

87 8.4. EXPERIMENTAL RESULTS 81 Set Number Bits Size Source of images per pixel CT D-Lab CT D-Lab CT RSNA96/algotec24 CT RSNA96/picker3 CT PHILIPS test images MR D-Lab MR D-Lab MR RSNA96/ge20 MR RSNA96/picker4 MR RSNA96/picker6 Table 8.1: Specifications of the sets of the test CT and MR images. means that the compression algorithm is quite robust to the quality of the lossy layer. From a practical point of view, it means that one can specify the same rate for the lossy layer for all images of a particular device or/and modality without sacrificing the compression performance. The results showing average compression rates (in bits per pixel) and peak signal-tonoise ratio (PSNR) for the sets and different tolerance values are listed in Tables The rates include bits spent for the lossy layer. For comparison purposes we use the recent JPEG-LS standard for lossless image compression 8, which also allows for nearlossless coding. The New opt. column shows results obtained by setting optimal lossy layer rate individually for each image in a set. Bit rates and PSNR s for images across the sets are shown as a plot in Appendices A and B, respectively. One can see that the proposed method consistently outperforms JPEG-LS for large tolerance values (δ = 13 and δ = 4). In the lossless mode, their performance is similar. Tables show the compression performance in the lossless mode compared to the JPEG-LS and JPEG2000 standards. Remark 1: According to Tables , 8.8, performance of the proposed algorithm is much worse than the performance of JPEG-LS for the set CT3 (by about 9% in the lossless mode, see Table 8.8). One reason for this can be a feature of images in the set: the real image takes only a circle in the square matrix of pixels (the pixels outside the circle have zero values). Thus, the compression algorithms should handle the unused pixels as if it is a part of the image. JPEG-LS and JPEG2000 have a special mode for efficient coding of such flat regions in an image (the so called run mode ). The proposed algorithm does not have it. Remark 2: An average performance of the proposed algorithm on the set of MR images is slightly worse compared to the performance on CT images. It can be due to a higher noise level in the MR images. The proposed algorithm, apparently, does not perform well on too noisy images for small tolerance values compared with prediction 8 Executables are available at We should note, that this implementation does not use arithmetic coding, instead, it uses Golobm coding.

88 82 CHAPTER 8. COMPRESSION OF MEDICAL IMAGES δ = 13 rate PSNR Set JPEG-LS New New opt. JPEG-LS New New opt Average Table 8.2: Average bit rates and PSNR for the sets of CT images for δ = 13. δ = 4 rate PSNR Set JPEG-LS New New opt. JPEG-LS New New opt Average Table 8.3: Average bit rates and PSNR for the sets of CT images for δ = 4. based algorithms. The compression performance of the algorithm depends on the set of images used for optimization of context models for the first and the second refinement layers. The results given in the tables were obtained by using a set of 60 ordinary 8-bit natural (non medical) images found on the internet in different databases of test images. We also tried to optimize the models based on statistics of images within a set by dividing it into two subsets: one for the model optimization (60 randomly chosen images) and another one for test compression. The results for the three sets (two CT and one MR set) are given in Table The reader can see that there is not much gain in using specially selected training data in context model optimization process. Our conclusion is that statistical properties of natural and medical images are similar at this level. 8.5 Reconstruction with minimum MSE criterion An interesting property of a lossy plus near-lossless technique is the ability to reconstruct the image with a lower MSE, than in the normal (near-lossless) mode, while providing a strict bound on the maximum absolute difference value, although being somewhat larger. Since this property is applicable to any lossy plus near-lossless approach regardless

89 8.5. RECONSTRUCTION WITH MINIMUM MSE CRITERION 83 δ = 1 rate PSNR Set JPEG-LS New New opt. JPEG-LS New New opt Average Table 8.4: Average bit rates and PSNR for the sets of CT images for δ = 1. δ = 13 rate PSNR Set JPEG-LS New New opt. JPEG-LS New New opt Average Table 8.5: Average bit rates and PSNR for the sets of MR images for δ = 13. whether it is embedded or not, we consider one layer refinement scheme corresponding to some tolerance value δ (one may as well apply this reconstruction technique to the algorithm described in [4]). Let I be the quantized difference signal, i.e., d + sign(d)δ Q(d) = I =. (8.17) 2δ + 1 Normally, the reconstructed difference value is calculated as follows: and the reconstruction pixel value is defined as ˆd I = I(2δ + 1) (8.18) ˆv I = ˆv + ˆd I (8.19) (ˆv is the pixel value of the lossy layer). The rule (8.18) defines the reconstructed value as a midpoint of the quantization interval. It provides symmetric bounding of the absolute difference errors defined by tolerance value and minimizes the absolute difference distortion. However, this reconstruction value may not be optimal for obtaining minimum MSE. Observation of the error distribution after correcting pixel values of the lossy layer to satisfy the required tolerance suggests, that the reconstruction value satisfies both the

90 84 CHAPTER 8. COMPRESSION OF MEDICAL IMAGES δ = 4 rate PSNR Set JPEG-LS New New opt. JPEG-LS New New opt Average Table 8.6: Average bit rates and PSNR for the sets of MR images for δ = 4. δ = 1 rate PSNR Set JPEG-LS New New opt. JPEG-LS New New opt Average Table 8.7: Average bit rates and PSNR for the sets of MR images for δ = 1. minimum L and L 2 criteria only if the difference signal falls in the central quantization interval, see Figure 8.5(a) 9. In other cases, optimal reconstruction values for these distortion measures are different. Figure 8.5(b) shows the distribution of the errors if the difference signal falls in one of positive quantization intervals. (For negative intervals the distribution is similar but flipped around zero.) For the minimum MSE criterion an optimal reconstruction value is defined by the centroid of the quantization interval [18]: The reconstruction rule can be written as d I = E {d Q(d) = I}. (8.20) d I = ˆd I + D I = I(2δ + 1) + D I, (8.21) where D I = [ { E (d ˆd }] I ) Q(d) = I, (8.22) where [ ] denotes the rounding operation. Rounding is needed because we allow only integer reconstruction values. In general, the optimal reconstruction levels ˆd I (or D I ) should be found individually for each quantization interval. However, the distribution of the difference signal can 9 One can also see from this picture that the distribution of the difference signal is similar to the two-sided geometric distribution

91 8.5. RECONSTRUCTION WITH MINIMUM MSE CRITERION 85 Set JPEG-LS New New opt. JPEG Average Table 8.8: Average bit rates for the sets of CT images in the lossless mode. Set JPEG-LS New New opt. JPEG2000 MR MR MR MR MR Average Table 8.9: Average bit rates for the sets of MR images in the lossless mode. be approximated by the two-sided geometric distribution centered at zero. It is easy to verify that in this case D I=0 = 0 and D I>0 = D I<0 = D. Thus, the reconstruction rule becomes d I = { 0 I = 0 I(2δ + 1) sign(i)d, I 0 (8.23) where D is defined by D = [ { E sign( ˆd I )(d ˆd }] I ) I 0. (8.24) Reconstruction according to (8.23) increases the tolerance by D: δ = δ + D. However, since D is known in advance (at least bounded by δ, i.e., δ 2δ), the reconstruction with the minimum MSE criterion still allows to keep a bound on the maximum absolute difference value, which is now δ. Optimal D can be found in the off-line algorithm design based on the training set or can be estimated during encoding and sent to the decoder as side information. One can also make a set of context dependent values D. In our experiments, implementation of the minimum MSE reconstruction technique for the sets of test images resulted in decrease of the MSE by about % and increase of the PSNR by about db depending on the image set. The resulting PSNR and MSE for the sets of CT and MR images are listed in Tables 8.11, The values D resulting in the least MSE were found individually for each set. They can be calculated from the values δ given in the table.

92 86 CHAPTER 8. COMPRESSION OF MEDICAL IMAGES δ = 13 δ = 4 δ = 1 δ = 0 Set Opt. Opt. Opt. Opt. CT CT MR Table Average bit rates for selected sets of CT and MR images obtained by optimizing the context models for the sets. Normal Minimum MSE Set δ PSNR MSE δ PSNR MSE CT db CT db CT db CT db CT db Table Average PSNR and MSE for normal and minimum MSE reconstruction criterion for the sets of CT images. 8.6 Summary We proposed an image compression technique, which allows for sequentially refinable near-lossless coding. The technique is based on lossy plus refinement coding and embedded quantization of the difference signal. For efficient coding of quantization indices we use binarization of the source symbols combined with binary arithmetic coding. This method allows for efficient statistical modeling for conditional entropy coding of binary decisions avoiding context dilution problem. The proposed technique can be easily extended for coding 3-D images. We should note, however, that due to the discreteness of the distortion measure (the tolerance is defined by an integer value), the full embedding can not be achieved for this kind of algorithms. Moreover, optimal compression imposes additional constraints on the sequence of allowable tolerance values. Despite that fact, such (partial) embedding with the possibility of robust control of the distortion is an attractive alternative to traditional embedded coding techniques. The compression performance of the proposed method was tested on sets of CT and MR medical images. We show that the method yields better compression performance for tolerance values larger than 3 compared to the JPEG-LS standard. Yet, in the lossless mode the performance is slightly inferior in general. Another advantage of the proposed technique is the possibility of reconstruction according to the minimum MSE criterion, which still allows for a strict control of the maximum absolute difference distortion. This makes the proposed algorithm a good candidate for compression of medical images.

93 8.6. SUMMARY 87 Normal Minimum MSE Set δ PSNR MSE δ PSNR MSE MR db MR db MR db MR db MR db Table Average PSNR and MSE for normal and minimum MSE reconstruction criterion for the sets of MR images.

94 88 CHAPTER 8. COMPRESSION OF MEDICAL IMAGES P(v ˆv I I = 0) (a) v ˆv I P(v ˆv I I > 0) (b) v ˆv I Figure 8.5. Examples of probability distributions of errors after near-lossless refinement of the lossy layer (the figures correspond to the near-lossless coding with δ = 13) for two cases: (a) the difference value falls in the zero quantization interval; (b) the difference value falls in the positive quantization intervals. The distributions are normalized by the probability of the interval.

95 Bibliography [1] J. Åberg, A universal source coding perspective on PPM, Ph.D. dissertation, Lund University, [2] N. M. Abramson, Information theory and coding. New-York: McGraw-Hill, [3] M. D. Adams and F. Kossentini, Reversible integer-to-integer wavelet transforms for image compression: Performance evaluation and analysis, IEEE Trans. Image Processing, vol. 9, no. 6, pp , [4] R. Ansari, N. Memon, and E. Ceran, Near-lossless image compression techniques, J. Electronic Imaging, vol. 7, no. 3, pp , [5] I. Avcibas, B. Sankur, N. Memon, and K. Sayood, A progressive lossless/nearlossless image compression algorithm, IEEE Sign. Proc. Letters, vol. 9, no. 10, pp , [6] A. Barron, J. Rissanen, and B. Yu, The minimum description length principle in coding and modeling, IEEE Trans. Inform. Theory, vol. 44, no. 6, pp , [7] T. Berger, Rate distrotion theory. Englewood Cliffs, NJ: Prentice-Hall, [8] A. R. Calderbank, I. Daubechies, W. Sweldens, and B.-L. Yeo, Lossless image compression using integer to integer wavelet transforms, in Proc. Int. Conf. Image Processing, vol. 1, Oct. 1997, pp [9], Wavelet transforms that map integers to integers, Applied and Computational Harmonic Analysis, vol. 5, no. 3, pp , [10] K. Chen and T. V. Ramabadran, Near lossless compression of medical images though entropy coded DPCM, IEEE Trans. Med. Imag., vol. 13, no. 3, pp , [11] L. Davisson, Universal noiseless coding, IEEE Trans. Inform. Theory, vol. 19, no. 6, pp , [12] W. H. R. Equitz and T. M. Cover, Successive refinement of information, IEEE Trans. Inform. Theory, vol. 37, no. 2, pp ,

96 90 BIBLIOGRAPHY [13] M. Feder and N. Merhav, Hierarhical universal coding, IEEE Trans. Inform. Theory, vol. 42, no. 5, pp , [14] W. Feller, An introduction to probability theory and its applications. New-York: Wiley, 1968, vol. 1. [15] S. Forchhammer, X. Wu, and J. D. Andersen, Lossless image data sequence compression using optimal context quantization, in Proc. Data Compression Conf., 2001, pp [16] R. G. Gallager, Information theory and reliable communication. NJ: John Wiley & Sons, [17], Variations on a theme by Huffman, IEEE Trans. Inform. Theory, vol. 24, no. 6, pp , [18] A. Gersho and R. M. Gray, Vector quantization and signal compression. Kluwer Academic Publishers, [19] S. Golomb, Run-length encodings, IEEE Trans. Inform. Theory, vol. 12, no. 3, pp , [20] A. Grossman and J. Morlet, Decomposition of Hardy functions into square integrable wavelets of constant shape, SIAM J. Math. Anal., vol. 15, pp , [21] P. G. Howard and J. S. Vitter, Arithmetic coding for data compression, Proc. IEEE, vol. 82, no. 6, pp , [22] F. Jelinec, Probabilistic information theory. New-York: McGraw-Hill, [23] R. E. Krichevsky and V. K. Trofimov, The performance of universal encoding, IEEE Trans. Inform. Theory, vol. 27, no. 2, pp , [24] G. G. Langdon and J. J. Rissanen, A simple general binary source coding, IEEE Trans. Inform. Theory, vol. 28, no. 5, pp , [25] D. LeGall, MPEG: a video compression standard for multimedia applications, Communications of the ACM, vol. 34, no. 4, pp , [26] S. Mallat, A theory for multiresolution signal decomposition: The wavelet representation, IEEE Trans. Pattern Anal. Mach. Int., vol. 11, no. 7, pp , [27] B. Martins and S. Forchhammer, Tree coding of bilevel images, IEEE Trans. Image Processing, vol. 7, no. 4, pp , [28] N. Merhav, G. Seroussi, and M. J. Weinberger, Optimal prefix codes for twosided geometric distributions, IEEE Trans. Inform. Theory, vol. 46, no. 1, pp , 2000.

97 BIBLIOGRAPHY 91 [29] A. Moffat, R. M. Neal, and I. H. Witten, Arithmetic coding revisited, ACM Transactions on Information Systems (TOIS), vol. 16, no. 3, pp , [30] J. Morlet, Wave propagation and sampling theory and, Geophisics, vol. 47, pp , [31] J. O Neal, Predictive quantizing differencial pulse code modulation for the transmission of television signals, Bell Syst. Tech. J., vol. 45, no. 1966, pp [32] R. C. Pasco, Source coding algorithms for fast data compression, Ph.D. dissertation, Stanford University, [33] W. B. Pennebaker and J. L. Mitchell, JPEG Still Image Data Compression Standard. New York: Van Nostrand Reinhold, [34] W. B. Pennebaker, J. L. Mitchell, G. G. Langdon, and R. B. Arps, An overview of the basic principles of the Q-coder adaptive binary arithmetic coder, IBM J. Res. & Devel., vol. 32, no. 6, pp , [35] K. R. Rao and P. Yip, Discrete Cosine Transform: Algorithms, advantages, applications. Boston: Academic Press, [36], The Transform and Data Compression Handbook. Boca Raton: CRC Press, [37] R. C. Reininger and J. D. Gibson, Distributions of the two-dimentional DCT coefficients for images, IEEE Trans. Commun., vol. 31, no. 6, pp , [38] R. F. Rice, Some practical universal noisless coding technique, JPL Publication 79-22, March [39], Lossless coding standards for space data systems, in Conference Record of the Thirtieth Asilomar conference on Signals, Systems and Computers, vol. 1, 1997, pp [40] J. Rissanen, Universal coding, information, prediction and estimation, IEEE Trans. Inform. Theory, vol. 30, no. 4, pp , [41], Stochastic complexity in statistical inquiry. Singapore: World Scientific Press, [42] J. Rissanen and G. G. Langdon, Universal modeling and coding, IEEE Trans. Inform. Theory, vol. 27, no. 1, pp , [43] J. J. Rissanen, Generalized Craft inequality and arithmetic coding, IBM J. Res. & Devel., vol. 20, no. 3, pp , [44], A universal data compression system, IEEE Trans. Inform. Theory, vol. 29, no. 5, pp , 1983.

98 92 BIBLIOGRAPHY [45] B. Y. Ryabko, Twice-universal coding, Problems of Information Transmission, vol. 20, no. 3, pp , [46] B. Y. Ryabko and A. N. Fionov, Efficient method of adaptive arithmetic coding for sources with large alphabet, Problems of Information Transmission, vol. 35, no. 4, pp , [47] A. Said and W. Pearlman, A new, fast and efficient image codec based on set partitioning in hierarchical trees, IEEE Trans. Circuits Syst. Video Technol., vol. 6, no. 3, pp , [48] A. Said and W. A. Pearlman, An image multiresolution representation for lossless and lossy compression, IEEE Trans. Image Processing, vol. 5, no. 9, pp , [49] R. Schäfer and T. Sikora, Digital video coding standards and their role in video communications, Proceedings IEEE, vol. 83, no. 6, pp , [50] C. E. Shannon, A mathematical theory of communication, Bell Sys. Tech. Journal, vol. 27, no. 3,4, pp , , [51] J. M. Shapiro, Embedded image coding using zerotrees of wavelet coefficients, IEEE Trans. Signal Proc., vol. 41, no. 12, pp , [52] Y. M. Shtarkov, T. J. Tjalkens, and F. M. J. Willems, The context-tree weighting method: Basic properties, IEEE Trans. Inform. Theory, vol. 41, no. 3, pp , [53] Y. M. Shtarkov, Coding of messages of finite length at the output of a source with unknown statistics, in Proc. 5th Conference on Coding Theory and Information Transmission, Moscow, 1972, pp , (in Russian). [54], Universal sequential coding of single messages, Probl. Inform. Transmission, vol. 23, no. 3, pp. 3 17, [55] D. S. Taubman, High performance scalable image compression with EBCOT, IEEE Trans. Image Processing, vol. 9, no. 7, pp , [56] D. S. Taubman and M. W. Marcellin, JPEG2000: Image compression fundamentals, standards and practice. Norwell, Massachusetts: Kluwer Academic Publishers, [57] M. J. Weinberger, J. J. Rissanen, and R. B. Arps, Applications of universal context modeling to lossless compression of gray-scale images, IEEE Trans. Image Processing, vol. 5, no. 4, pp , [58] M. J. Weinberger and J. Rissanen, A universal finite memory source, IEEE Trans. Inform. Theory, vol. 41, no. 3, pp , 1995.

99 BIBLIOGRAPHY 93 [59] M. J. Weinberger, G. Seroussi, and G. Sapiro, LOCO-I: A low complexity, context-based, lossless image compression algorithm, in Proc. Data Compression Conf., 1996, pp [60], The LOCO-I lossless image compression algorithm: Principles and standardization into JPEG-LS, IEEE Trans. Image Processing, vol. 9, no. 8, pp , [61] I. H. Witten, R. M. Neal, and J. G. Cleary, Arithmetic coding for data compression, Communications of the ACM, vol. 30, no. 6, pp , [62] S. Wong, L. Zaremba, D. Gooden, and H. Huang, Radiologic image compression A review, Proceedings of the IEEE, vol. 83, no. 2, pp , [63] X. Wu, An algorithmic study on lossless image compression, in Proc. Data Compression Conf., 1996, pp [64], Compression of wavelet transform coefficients, in The Transform and Data Compression Handbook, K. R. Rao and P. C. Yip, Eds. Boca Raton: CRC Press, 2001, ch. 8, pp [65] X. Wu and P. Bao, L -constrained high-fidelity image compression via adaptive context modeling, IEEE Trans. Image Processing, vol. 9, no. 4, pp , [66] X. Wu, P. A. Chou, and S. Forchhammer, Minimum conditional entropy context quantization, submitted to IEEE Trans Inf. Theory. [67] X. Wu, P. A. Chou, and X. Xue, Minimum conditional entropy context quantization, in Proc. IEEE Int. Symp. on Inf. Theory, 2000, p. 43. [68] X. Wu and N. Memon, Context-based, adaptive, lossless image codec, IEEE Trans. Commun., vol. 45, no. 4, pp , [69] A. Zandi, J. D. Allen, E. L. Schwartz, and M. Boliek, CREW: Compression with reversible embedded wavelets, in Proc. Data Compression Conf., 1995, pp

100 94 BIBLIOGRAPHY

101 Appendix A Bit rates for test sets of medical images Note: Image number in the set may not correspond to the original number in the volume, moreover, some sets may contain images from different volumes or scenes. Yet, they are of the same size, modality and produced by the same device. 95

102 96 APPENDIX A. BIT RATES FOR TEST SETS OF MEDICAL IMAGES Rate, bpp 1.4 New JPEG LS Image number in the set Figure A.1: Bit rates for images of the set CT1 for δ = 13. Rate, bpp New JPEG LS Image number in the set Figure A.2: Bit rates for images of the set CT1 for δ = 4.

103 97 Rate, bpp 4 New JPEG LS Image number in the set Figure A.3: Bit rates for images of the set CT1 for δ = 1. Rate, bpp 5.5 New JPEG LS Image number in the set Figure A.4: Bit rates for images of the set CT1 for δ = 0.

104 98 APPENDIX A. BIT RATES FOR TEST SETS OF MEDICAL IMAGES Rate, bpp New JPEG LS Image number in the set Figure A.5: Bit rates for images of the set CT2 for δ = 13. Rate, bpp 3.5 New JPEG LS Image number in the set Figure A.6: Bit rates for images of the set CT2 for δ = 4.

105 99 Rate, bpp 5 New JPEG LS Image number in the set Figure A.7: Bit rates for images of the set CT2 for δ = 1. Rate, bpp 6.5 New JPEG LS Image number in the set Figure A.8: Bit rates for images of the set CT2 for δ = 0.

106 100 APPENDIX A. BIT RATES FOR TEST SETS OF MEDICAL IMAGES Rate, bpp 1.3 New JPEG LS Image number in the set Figure A.9: Bit rates for images of the set CT3 for δ = 13. Rate, bpp 2 New JPEG LS Image number in the set Figure A.10: Bit rates for images of the set CT3 for δ = 4.

107 101 Rate, bpp New JPEG LS Image number in the set Figure A.11: Bit rates for images of the set CT3 for δ = 1. Rate, bpp New JPEG LS Image number in the set Figure A.12: Bit rates for images of the set CT3 for δ = 0.

108 102 APPENDIX A. BIT RATES FOR TEST SETS OF MEDICAL IMAGES Rate, bpp New JPEG LS Image number in the set Figure A.13: Bit rates for images of the set CT4 for δ = 13. Rate, bpp New JPEG LS Image number in the set Figure A.14: Bit rates for images of the set CT4 for δ = 4.

110 104 APPENDIX A. BIT RATES FOR TEST SETS OF MEDICAL IMAGES Rate, bpp 1 New JPEG LS Image number in the set Figure A.17: Bit rates for images of the set CT5 for δ = 13. Rate, bpp 2 New JPEG LS Image number in the set Figure A.18: Bit rates for images of the set CT5 for δ = 4.

112 106 APPENDIX A. BIT RATES FOR TEST SETS OF MEDICAL IMAGES Rate, bpp 2.5 New JPEG LS Image number in the set Figure A.21: Bit rates for images of the set MR1 for δ = 13. Rate, bpp 4 New JPEG LS Image number in the set Figure A.22: Bit rates for images of the set MR1 for δ = 4.

113 107 Rate, bpp 5.58 New JPEG LS Image number in the set Figure A.23: Bit rates for images of the set MR1 for δ = 1. Rate, bpp New JPEG LS Image number in the set Figure A.24: Bit rates for images of the set MR1 for δ = 0.

114 108 APPENDIX A. BIT RATES FOR TEST SETS OF MEDICAL IMAGES Rate, bpp New JPEG LS Image number in the set Figure A.25: Bit rates for images of the set MR2 for δ = 13. Rate, bpp 2 New JPEG LS Image number in the set Figure A.26: Bit rates for images of the set MR2 for δ = 4.

115 109 Rate, bpp 3 New JPEG LS Image number in the set Figure A.27: Bit rates for images of the set MR2 for δ = 1. Rate, bpp New JPEG LS Image number in the set Figure A.28: Bit rates for images of the set MR2 for δ = 0.

116 110 APPENDIX A. BIT RATES FOR TEST SETS OF MEDICAL IMAGES Rate, bpp 1.5 New JPEG LS Image number in the set Figure A.29: Bit rates for images of the set MR3 for δ = 13. Rate, bpp New JPEG LS Image number in the set Figure A.30: Bit rates for images of the set MR3 for δ = 4.

117 111 Rate, bpp New JPEG LS Image number in the set Figure A.31: Bit rates for images of the set MR3 for δ = 1. Rate, bpp New JPEG LS Image number in the set Figure A.32: Bit rates for images of the set MR3 for δ = 0.

118 112 APPENDIX A. BIT RATES FOR TEST SETS OF MEDICAL IMAGES Rate, bpp 2 New JPEG LS Image number in the set Figure A.33: Bit rates for images of the set MR4 for δ = 13. Rate, bpp New JPEG LS Image number in the set Figure A.34: Bit rates for images of the set MR4 for δ = 4.

119 113 Rate, bpp New JPEG LS Image number in the set Figure A.35: Bit rates for images of the set MR4 for δ = 1. Rate, bpp New JPEG LS Image number in the set Figure A.36: Bit rates for images of the set MR4 for δ = 0.

120 114 APPENDIX A. BIT RATES FOR TEST SETS OF MEDICAL IMAGES Rate, bpp 1.3 New JPEG LS Image number in the set Figure A.37: Bit rates for images of the set MR5 for δ = 13. Rate, bpp 2.5 New JPEG LS Image number in the set Figure A.38: Bit rates for images of the set MR5 for δ = 4.

121 115 Rate, bpp 4.2 New JPEG LS Image number in the set Figure A.39: Bit rates for images of the set MR5 for δ = 1. Rate, bpp 5.7 New JPEG LS Image number in the set Figure A.40: Bit rates for images of the set MR5 for δ = 0.

122 116 APPENDIX A. BIT RATES FOR TEST SETS OF MEDICAL IMAGES

123 Appendix B PSNR for test sets of medical images Note 1: Image number in the set may not correspond to the original number in the volume, moreover, some sets may contain images from different volumes or scenes. Yet, they are of the same size, modality and produced by the same device. Note 2: Images in the set may use the number of bits per pixel different from the specification in Table 8.2 (less bits per pixel). This may cause drastic changes in PSNR form image to image, since PSNR is calculated based on the real pixels precision. 117

124 118 APPENDIX B. PSNR FOR TEST SETS OF MEDICAL IMAGES PSNR 65 New JPEG LS Image number in the set PSNR 64 Figure B.1: PSNR for images of the set CT1 for δ = New JPEG LS Image number in the set Figure B.2: PSNR for images of the set CT2 for δ = 13.

125 119 PSNR 64 New JPEG LS Image number in the set Figure B.3: PSNR for images of the set CT3 for δ = 13. PSNR 62 New JPEG LS Image number in the set Figure B.4: PSNR for images of the set CT4 for δ = 13.

126 120 APPENDIX B. PSNR FOR TEST SETS OF MEDICAL IMAGES PSNR New JPEG LS Image number in the set PSNR 62 Figure B.5: PSNR for images of the set CT5 for δ = New JPEG LS Image number in the set Figure B.6: PSNR for images of the set MR1 for δ = 13.

127 121 PSNR 55 New JPEG LS Image number in the set PSNR Figure B.7: PSNR for images of the set MR2 for δ = New JPEG LS Image number in the set Figure B.8: PSNR for images of the set MR3 for δ = 13.

128 122 APPENDIX B. PSNR FOR TEST SETS OF MEDICAL IMAGES PSNR 54 New JPEG LS Image number in the set PSNR 58 Figure B.9: PSNR for images of the set MR4 for δ = New JPEG LS Image number in the set Figure B.10: PSNR for images of the set MR5 for δ = 13.