Gambling and Data Compression Gambling. Horse Race Definition The wealth relative S(X) = b(x)o(x) is the factor by which the gambler s wealth grows if horse X wins the race, where b(x) is the fraction of the gambler s wealth invested in horse X and o(x) is the corresponding odds. Definition The doubling rate of a horse race is W (b, p) = E[log S(X)] = m p k log b k o k Theorem Let the race outcomes X, X 2, be i.i.d. p(x). Then the wealth of the gambler using betting strategy b grows exponentially at rate W (b, p); that is, k= S n. = 2 nw (b,p) Definition The optimum doubling rate W (p) is the maximum doubling rate over all choices of the portfolio b: W (p) = max W (b, p) = max b b:b i 0, P i b i= m p i log b i Theorem (proportional gambling is log-optimal) the optimal doubling rate is given by i= W (p) = p i log H(p) and is achieved by the proportional gambling scheme b = p. Theorem (Conservation theorem) For uniform fair odds, W (p) + H(p) = log m Thus, the sum of the doubling rate and the entropy is a constant. If the gambler does not always bet all the money, then the optimum strategy may depend on the odds and will not necessarily have the simple form of proportional gambling. There are three cases:. Fair odds with respect to some distribution: =. By betting b i =, one achieves S(X) =, which is the same as keeping some cash aside. Proportional betting is optimal. 2. Superfair odds: <. By choosing b i = c, where c = /, one has S(X) = / > with probability. In this case, the gambler will always want to bet all the money and the optimum strategy is again proportional betting. 3. Subfair odds: >. Proportional gambling is no longer log-optimal. The gambler may want to bet only some of the money and keep the rest aside as cash, depending on the odds. Based on Cover & Thomas, Chapter 5,6
.2 Side Information and Entropy Rate Definition The increase W is defined as: W = W (X Y ) W (X), where W (X) = max b(x) W (X Y ) = max b(x y) p(x) log b(x)o(x) x p(x, y) log b(x y)o(x) x,y Theorem The increase W in doubling rate due to side information Y for a horse race X is W = I(X; Y )..3 Dependent horse races and the entropy rate If the horse races are dependent, suppose that the winning horses form a stochastic process {X k }: The optimal doubling rate for uniform fair odds (m for ) is, W (X k X k, X k 2,, X ) = log m H(X k X k, X k 2,, X ), which is achieved by b (x k x k,, x ) = p(x k x k,, x ). The doubling rate then satisfies n E[log S n] = log m H(X,, H n ) n Thus in the limit as n, the doubling rate is related to the entropy rate as lim n n E[log S n] = log m H(X ). 2 Data Compression Codes and Optimality 2. Definitions and Examples of Codes Definition A source code C for a random variable X is a mapping from X, the range of X, to D, the set of finite-length strings of symbols from a D-ary alphabet. Definition Let C(x) denote the codeword corresponding to x and let l(x) denote its length. Then the expected length of source code C(x) for random variable X with pmf p(x) is given by L(C) := E p [l(x)] = x X p(x)l(x), Definition A code is said to be nonsingular if every element of the range of X maps into a different string in D ; that is, x x C(x) C(x ). 2
Definition The extension C of a code C is the mapping from finite-length strings of X to finite-length strings of D, defined by C(x x 2 x n ) = C(x )C(x 2 ) C(x n ), where C(x )C(x 2 ) C(x n ) indicates concatenation of the corresponding codewords. Definition A code is called uniquely decodable if its extension is nonsingular. May have tnspect entire string to decode first codeword. Definition A code is called a prefix code or an instantaneous code if no codeword is a prefix of any other codeword. Instantaneous code self-punctuating code. 2.2 Instantaneous Codes and the Kraft Inequality Theorem (Kraft inequality) For any instantaneous code (prefix code) over an alphabet of size D, the codeword lengths l, l 2,..., l m must satisfy the inequality i Conversely, given a set of codeword lengths that satisfy this inequality, there exists an instantaneous code with these word lengths. Theorem (Extended Kraft inequality) For any countably infinite set of codewords that form a prefix code, the codeword lengths satisfy the extended Kraft inequality: i= Conversely, given any l, l 2,... satisfying the extended Kraft inequality, we can construct a prefix code with these codeword lengths. 3 Data Compression Optimal Codes and Length Bounds 3. Optimal Codes Definition A probability distribution is called D-adic if each of the probabilities is equal to D n for some n. Theorem (Lower bound on codeword length) The expected length L of any instantaneous D-ary code for a random variable X is greater than or equal to the base-d entropy H D (X): Equality holds iff the distribution of X is D-adic. L H D (x), with equality iff = p i for all i. 3
3.2 Bounds on the Optimal Code Length The previous theorem suggests finding the D-adic distribution vector r closest to a given source distribution vector p, and then designing a code for r. By minimizing D(p r) for D-adic r, we may exhibit a code (not necessarily optimal) whose length L satisfies the following bound: Theorem (Optimal expected codeword length) Let l, l 2,..., l m be optimal codeword lengths for a source distribution p and a D-ary alphabet, and let L be the associated expected length of an optimal code (L = p i l i ). Then H D (x) L < H D (x) +. Consider sending a sequence of n symbols drawn iid according to p(x) in a block, so that we have a supersymbol from X n. Let L n be the expected codeword length per input symbol: L n := n E [l(x, X 2,..., X n )]. Then by letting the block length n become large, we may achieve an expected length per symbol L n arbitrarily close to the entropy: Theorem (Distributing the extra overhead bit) The minimum expected codeword length per symbol satisfies H(X, X 2,..., X n ) L n < H(X, X 2,..., X n ) + n n n. Moreover, if X, X 2,..., X n is a stationary stochastic process, then L n H(X ), where H(X ) is the entropy rate of the process. The previous theorem confirms that the entropy rate of a stationary stochastic process is indeed the minimum expected number of bits per symbol needed to describe the process. If we design a code for the wrong input distribution, then the increase in expected description length is given exactly by the relative entropy: Theorem (Wrong code) The expected length under p(x) of the code assignment l(x) = log q(x) satisfies H(p) + D(p q) E p [l(x)] < H(p) + D(p q) +. 3.3 Kraft Inequality for Uniquely Decodable Codes In the sense of expected length, the set of uniquely decodable codes while larger does not improve upon instantaneous codes: Theorem (McMillan) The codeword lengths of any uniquely decodable D-ary code must satisfy the Kraft inequality: i Conversely, given a set of codeword lengths satisfying this inequality, it is possible to construct a uniquely decodable code with these lengths. Corollary A uniquely decodable code for an infinite source alphabet X also satisfies the Kraft inequality. 4
3.4 Huffman Codes: Optimality and Examples Consider the tree construction we used earlier in order to suggest a proof of the Kraft inequality for finite, instantaneous codes. It suggests a constructive procedure for assigning codewords in a manner such that their lengths are roughly inversely proportional to corresponding symbol probabilities. We now formalize this idea through an example of Huffman codes: Construct a D-ary tree from which codewords can be assigned. Build up the tree recursively by combining D lowest-probability symbols together at each stage. A simple algorithm due to Huffman allows for the construction of optimal prefix codes for a given distribution: Lemma (Existence of a particular optimal code) For any distribution, there exists an optimal instantaneous code (with minimum expected length) that satisfies the following properties:. The lengths are ordered inversely with the probabilities (i.e., if p j > p k, then l j l k ). 2. The two longest codewords have the same length. 3. Two of the longest codewords differ only in the last bit and correspond to the two least likely symbols. Theorem (Optimality of Huffman coding) Huffman coding is optimal; that is, if C is a Huffman code and C is any other uniquely decodable code, then L(C ) L(C ). 3.5 Shannon-Fano-Elias Coding Fano proposed a suboptimal procedure based on recursively partitioning the unit interval, under the assumption that symbol probabilities are given in decreasing order. A related procedure, Shannon-Fano-Elias coding, makes direct use of the cumulative distribution function (cdf) F (x) to assign codewords. By using the midpoint F (x) of each jump in the cdf, we may exhibit a prefix-free code C satisfying Codeword lengths l(x) = log p(x) +. Expected length L(C) < H(X) + 2. 5