THIS paper deals with a situation where a communication



Similar documents
I. INTRODUCTION. of the biometric measurements is stored in the database

Communication on the Grassmann Manifold: A Geometric Approach to the Noncoherent Multiple-Antenna Channel

Diversity and Multiplexing: A Fundamental Tradeoff in Multiple-Antenna Channels

Optimal Design of Sequential Real-Time Communication Systems Aditya Mahajan, Member, IEEE, and Demosthenis Teneketzis, Fellow, IEEE

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

Linear Codes. Chapter Basics

Analysis of Mean-Square Error and Transient Speed of the LMS Adaptive Algorithm

Load Balancing and Switch Scheduling

Coding Theorems for Turbo Code Ensembles

On Directed Information and Gambling

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 54, NO. 8, AUGUST If the capacity can be expressed as C(SNR) =d log(snr)+o(log(snr))

MATH10212 Linear Algebra. Systems of Linear Equations. Definition. An n-dimensional vector is a row or a column of n numbers (or letters): a 1.

IN THIS PAPER, we study the delay and capacity trade-offs

SECRET sharing schemes were introduced by Blakley [5]

Adaptive Online Gradient Descent

4932 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 55, NO. 11, NOVEMBER 2009

IRREDUCIBLE OPERATOR SEMIGROUPS SUCH THAT AB AND BA ARE PROPORTIONAL. 1. Introduction

Increasing for all. Convex for all. ( ) Increasing for all (remember that the log function is only defined for ). ( ) Concave for all.

956 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 4, NO. 4, DECEMBER 2009

HOMEWORK 5 SOLUTIONS. n!f n (1) lim. ln x n! + xn x. 1 = G n 1 (x). (2) k + 1 n. (n 1)!

Adaptive Search with Stochastic Acceptance Probabilities for Global Optimization

Notes on Factoring. MA 206 Kurt Bryan

FUNDAMENTALS of INFORMATION THEORY and CODING DESIGN

Mathematics Course 111: Algebra I Part IV: Vector Spaces

2.3 Convex Constrained Optimization Problems

Master s Theory Exam Spring 2006

On the Traffic Capacity of Cellular Data Networks. 1 Introduction. T. Bonald 1,2, A. Proutière 1,2

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 5 9/17/2008 RANDOM VARIABLES

CHAPTER II THE LIMIT OF A SEQUENCE OF NUMBERS DEFINITION OF THE NUMBER e.

Sphere-Bound-Achieving Coset Codes and Multilevel Coset Codes

Integer Factorization using the Quadratic Sieve

Numerical Analysis Lecture Notes

Capacity Limits of MIMO Channels

Lecture 3: Finding integer solutions to systems of linear equations

ECON20310 LECTURE SYNOPSIS REAL BUSINESS CYCLE

Stationary random graphs on Z with prescribed iid degrees and finite mean connections

arxiv: v1 [math.pr] 5 Dec 2011

Continued Fractions and the Euclidean Algorithm

Notes 11: List Decoding Folded Reed-Solomon Codes

A Network Flow Approach in Cloud Computing

Persuasion by Cheap Talk - Online Appendix

An example of a computable

About the inverse football pool problem for 9 games 1

Analysis of a Production/Inventory System with Multiple Retailers

ALOHA Performs Delay-Optimum Power Control

Gambling and Data Compression

1 Approximating Set Cover

SHARP BOUNDS FOR THE SUM OF THE SQUARES OF THE DEGREES OF A GRAPH

Solving Systems of Linear Equations

Linear Authentication Codes: Bounds and Constructions

How To Prove The Dirichlet Unit Theorem

Week 1: Introduction to Online Learning

Comparison of Network Coding and Non-Network Coding Schemes for Multi-hop Wireless Networks

Analysis of Load Frequency Control Performance Assessment Criteria

CONTRIBUTIONS TO ZERO SUM PROBLEMS

Gambling Systems and Multiplication-Invariant Measures

Review Horse Race Gambling and Side Information Dependent horse races and the entropy rate. Gambling. Besma Smida. ES250: Lecture 9.

PÓLYA URN MODELS UNDER GENERAL REPLACEMENT SCHEMES

ON FIBER DIAMETERS OF CONTINUOUS MAPS

1. (First passage/hitting times/gambler s ruin problem:) Suppose that X has a discrete state space and let i be a fixed state. Let

FACTORING POLYNOMIALS IN THE RING OF FORMAL POWER SERIES OVER Z

THE NUMBER OF GRAPHS AND A RANDOM GRAPH WITH A GIVEN DEGREE SEQUENCE. Alexander Barvinok

Convex analysis and profit/cost/support functions

Metric Spaces. Chapter Metrics

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh

A New Interpretation of Information Rate

PACKET timing channels can sometimes be used to

By choosing to view this document, you agree to all provisions of the copyright laws protecting it.

Wald s Identity. by Jeffery Hein. Dartmouth College, Math 100

Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay

CPC/CPA Hybrid Bidding in a Second Price Auction

Stochastic Inventory Control

Chapter 21: The Discounted Utility Model

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 51, NO. 7, JULY G. George Yin, Fellow, IEEE, and Vikram Krishnamurthy, Fellow, IEEE

A Sublinear Bipartiteness Tester for Bounded Degree Graphs

FEGYVERNEKI SÁNDOR, PROBABILITY THEORY AND MATHEmATICAL

International Journal of Information Technology, Modeling and Computing (IJITMC) Vol.1, No.3,August 2013

LOGNORMAL MODEL FOR STOCK PRICES

Data Mining: Algorithms and Applications Matrix Math Review

SPECTRAL POLYNOMIAL ALGORITHMS FOR COMPUTING BI-DIAGONAL REPRESENTATIONS FOR PHASE TYPE DISTRIBUTIONS AND MATRIX-EXPONENTIAL DISTRIBUTIONS

NEW YORK STATE TEACHER CERTIFICATION EXAMINATIONS

MOST error-correcting codes are designed for the equal

Polarization codes and the rate of polarization

The Relation between Two Present Value Formulae

Load Balancing with Migration Penalties

Prime Numbers and Irreducible Polynomials

An Adaptive Decoding Algorithm of LDPC Codes over the Binary Erasure Channel. Gou HOSOYA, Hideki YAGI, Toshiyasu MATSUSHIMA, and Shigeichi HIRASAWA

(2) (3) (4) (5) 3 J. M. Whittaker, Interpolatory Function Theory, Cambridge Tracts

Inner Product Spaces

A Practical Scheme for Wireless Network Operation

Efficient Recovery of Secrets

1 Sets and Set Notation.

1 Short Introduction to Time Series

BookTOC.txt. 1. Functions, Graphs, and Models. Algebra Toolbox. Sets. The Real Numbers. Inequalities and Intervals on the Real Number Line

F. ABTAHI and M. ZARRIN. (Communicated by J. Goldstein)

TD(0) Leads to Better Policies than Approximate Value Iteration

Coding Theorems for Turbo-Like Codes Abstract. 1. Introduction.

On Secure Communication over Wireless Erasure Networks

Row Ideals and Fibers of Morphisms

Integer roots of quadratic and cubic polynomials with integer coefficients

Transcription:

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 44, NO. 3, MAY 1998 973 The Compound Channel Capacity of a Class of Finite-State Channels Amos Lapidoth, Member, IEEE, İ. Emre Telatar, Member, IEEE Abstract A transmitter receiver need to be designed to guarantee reliable communication on any channel belonging to a given family of finite-state channels defined over common finite input, output, state alphabets. Both the transmitter receiver are assumed to be ignorant of the channel over which transmission is carried out also ignorant of its initial state. For this scenario we derive an expression for the highest achievable rate. As a special case we derive the compound channel capacity of a class of Gilbert Elliott channels. Index Terms Compound channel, error exponents, finite-state channels, Gilbert Elliott channel, universal decoding. I. INTRODUCTION THIS paper deals with a situation a communication system consisting of a transmitter a receiver needs to be designed for an unknown channel. The only information available to the designer is the family of channels over which the system will be used, based on this information the designer must design a transmitter (codebook) a receiver (decoder) that will guarantee reliable communication over any channel in the family. No feedback mechanism is available to the transmitter, the codebook must therefore be fixed before transmission begins. At the receiver end, the decoding rule must not depend on the channel over which communication takes place since the receiver too is assumed ignorant of the channel. This situation is commonly referred to as coding for the compound channel [1] or coding for a class of channels [2], [3]. The highest rate at which robust reliable communication is achievable in this setup is called the compound channel capacity of the family, or the capacity of the family. It has been studied in the case the family of channels consists of memoryless channels in [2] in the case the family consists of linear dispersive channels with an unknown distorting filter in [4]. In this paper we study the compound channel capacity for families of finite-state channels. Such channels exhibit memory Manuscript received January 2, 1997; revised June 17, 1997. The work of A. Lapidoth was supported in part by the NSF Faculty Early Career Development (CAREER) Program. The material in the paper was presented in part at the International Conference on Combinatorics, Information Theory Statistics, University of Southern Maine, Portl, ME, July 18 20, 1997. This research was conducted in part while A. Lapidoth was visiting Lucent Technologies, Bell Labs. A. Lapidoth is with the Department of Electrical Engineering Computer Science, Massachusetts Institute of Technology, Cambridge, MA 02139 USA. İ. E. Telatar is with Lucent Technologies, Bell Labs, Murray Hill, NJ 0797 USA. Publisher Item Identifier S 0018-9448(98)02347-5. are often used to model wireless communications in the presence of fading [5] [7]. Before turning to a precise statement of the problem results, we explain some of the difficulties encountered in the compound channel by considering the case the family of channels consists of memoryless channels only. For this case the compound channel capacity is given by [1] [3] is the family of discrete memoryless channels under consideration (defined over common finite input output alphabets), the maximization is over the set of input distributions, is the mutual information between the input output of the channel when the input distribution to the channel is. First note that the compound channel capacity is, in general, not equal to the infimum of the capacities of the different channels in the family, as the capacity-achieving input distributions may be different for different channels in the family. Once this is noticed, one soon realizes that the best one can hope for is to achieve (1). Notice, however, that in order to demonstrate that of (1) is achievable one must demonstrate that one can achieve with codes such that neither the codebook nor the decoder depend on the channel being used. One cannot employ maximum-likelihood or joint-typicality decoding with respect to the law of the channel in use as this law is unknown. Moreover, the classical rom coding argument is based on computing the average probability of error for a rom ensemble of codes then inferring that there exists at least one code that performs as well as this average. The choice of the codebook from the ensemble typically depends on the channel, one of the difficulties in proving (1) is in showing that there exists a codebook that is simultaneously good for all the channels in the family. Showing that one cannot guarantee reliable communication at any rate higher than is usually simpler, but requires work as it does not follow directly from the single-channel converse theorem: may be smaller than the capacity of every channel in the family. To prove a converse one assumes that a good codebook of rate is given, from it one then derives a distribution that satisfies is small is related to the blocklength the probability of error attained by the code. The distribution is typically related to the empirical distribution of the (1) (2) 0018 9448/98$10.00 1998 IEEE

974 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 44, NO. 3, MAY 1998 codebook, (2) is usually shown using Fano s inequality some convexity arguments. In this paper we shall derive a result analogous to (1) for finite-state channels. The result will be derived by demonstrating that for appropriate rates maximum-likelihood decoding the ensemble average probability of error for all the channels in the family can be uniformly bounded by an exponentially decreasing function of the blocklength, by invoking a result from [8] on the existence of a universal decoder for the class of finite-state channels that performs asymptotically as well as a maximum-likelihood decoder tuned to the channel in use. The rest of the paper is organized as follows. We conclude this introductory section with a more precise statement of the problem with a presentation of our main result the capacity of the compound channel consisting of finite-state channels. The coding theorem the converse theorem needed to prove this result are presented in Section II. In Section III we use this result to derive the capacity of a class of Gilbert Elliott channels. The paper concludes with a brief discussion in Section IV. Precise Statement of the Problem Main Result For a given set, let denote the set of probability distributions on. We will only deal with probability distributions on finite sets, so, we will identify an element of with a nonnegative function such that Given an initial state a distribution on, the joint distribution of the channel input output is welldefined, the mutual information between the input the output is given by We are abusing the notation to let take as arguments both the rom variables the distributions. Suppose now that we are given a class of discrete finitestate channels with common finite-state space, common finite input finite output alphabets. Each channel is characterized by in analogy to (3) (4) we will denote by the probability that the output of channel is the final state is conditional on the input initial state, by the probability that the output of channel is under the same conditioning. Given a channel, an initial state, a distribution on, the mutual information between the input output of the channel is given by (5) For a positive integer, denote by the set of probability distributions on with the property that is an integer for all. We will call such a an -type on. A discrete finite-state channel with input alphabet, output alphabet, state space is characterized by a conditional probability assignment Operationally, if at time the state of the channel is the input to the channel at time is, then the output of the channel at time the state of the channel at time are determined according to the distribution Definition 1: A rate is said to be achievable for the family of channels, if for any, there exists an such that for all there exists an encoder a decoder For such a channel, the probability that the channel output is the final channel state is conditional on the initial state the channel input is given by such that the average probability of error is less than irrespective of the initial state the channel over which the transmission is carried out. That is, (3) We can sum this probability over to obtain the probability that the channel output is conditional on the initial state the channel input (4) for all Since the state is not observed by the encoder or decoder, we will assume that the channels in the class have a common state space as long as each channel in the class has the same number of states. When we are presented with a class of finite-state channels with common input

LAPIDOTH AND TELATAR: THE COMPOUND CHANNEL CAPACITY OF FINITE-STATE CHANNELS 975 output alphabets each channel has at most but possibly less than states, we can equip the class with a state space with states that is common to all the channels in the family by augmenting the state space of those channels that have less than states. However, one has to be careful not to artificially introduce bad states in this process. An approach that will avoid this is the following. If a channel has states, pick one of its states, say state, define a new channel with states by II. PROOF OF THE CONVERSE AND A CODING THEOREM Before we can even start to prove the theorem, we need to show that the right-h side of (6) exists. This is a consequence of the following proposition, which is proved in Appendix I. Proposition 1: The maximum over in defining the sequence (7) is well-defined, the sequence converges. Moreover, The new channel the old channel satisfy thus a code that is good for one is good for the other. Let denote the compound channel capacity of the class, i.e., the supremum of all achievable rates for the class. We will prove the following theorem. Theorem 1: The compound channel capacity of the class of finite-state channels defined over common finite input, output, state alphabets is given by Analytic calculation of this limit is possible only in special cases, e.g., for a class of Gilbert Elliott channels which will be discussed below,, in general, the limit cannot even be computed numerically. Nonetheless, in the course of the paper we will establish a sequence of lower bounds monotonically increasing to (Proposition 1). A relatively simple finite-state channel that is often used to model communication in the presence of fading is the Gilbert Elliott channel, which has been studied in [6], [11], in references therein. In the Gilbert Elliott channel the channel input output alphabets are binary, the channel state is also binary, but for convenience we set corresponding to a bad state a good state. In this model the channel output sequence is related to the channel input sequence by denotes binary ( ) addition, is the realization of a binary hidden Markov noise process with internal state set, see Section III. The capacity of the Gilbert Elliott channel is achieved by an independent identically distributed (i.i.d.) Bernoulli input distribution, irrespective of the channel parameters [6]. Using Theorem 1 we shall show in Section III that under relatively mild conditions outlined in Theorem 2 the compound channel capacity of a class of Gilbert Elliott channels is equal to the infimum of the capacities of the members of the family. The highest rate at which reliable communication can be guaranteed is thus not reduced due to the ignorance of the transmitter receiver of the channel in use. (6) A. Converse Given a code of blocklength rate, error probability not exceeding for any channel initial state, define as otherwise. Then by Fano s inequality we get, for all thus by (7) (9) (8) (9) (10) the first inequality follows from Proposition 1, the second follows from the definition of (10). We then conclude that no rate above is achievable. B. Coding Theorem We will prove the coding theorem in a sequence of steps. The first step is to quote a result from [5] to show that if is less than if we employ maximum-likelihood decoding tuned to the channel over which the transmission is carried out, then we can find an such that if we view the channel inputs outputs in blocks of (that is, to consider an equivalent channel with input alphabet output alphabet ), construct i.i.d. rom codes for this channel (that is, codes each symbol of each codeword is chosen independently according to some distribution), we achieve exponentially decaying probability of error in increasing blocklength. The decay of the probability of error depends, of course, on the channel in use. The second step is to show that the resulting error exponent is bounded from below uniformly over the class. The third step is to convert this i.i.d. rom coding result to a different kind of rom coding, the codewords are chosen independently but uniformly over some set of input sequences. The last step is to invoke a theorem of [8] on the existence

976 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 44, NO. 3, MAY 1998 of a universal decoders for the class of finite-state channels to show that this is sufficient to construct good codes decoders (that do not require knowledge of the channel in use) for the compound channel. Given, let denote the average (over codebooks messages) probability of error that is incurred when a blocklength- rate- code whose codewords are drawn independently according to the distribution is used over the channel is decoded using a maximum-likelihood decoder tuned to the channel. We know [5, pp. 176 182] that for any. From Lemma 1 we have applying Lemma 2 to the channel maximizes the right- with the value Choosing h side in the range Lemma 1: Given. Let define by Then Proof: Identical to [5, pp. 179 180, proof of Lemma 5.9.1]. Lemma 2: For any channel with input alphabet output alphabet, for any distribution on else. Notice that for,. We conclude that for all (11) The important thing to note is that the right-h side is independent of both. We have thus seen that if one constructs a code of blocklength by constructing a rom code by choosing codewords independently according to, then the probability of error decays exponentially in as long as the rate of the code is below. We now show that this implies that rom coding by choosing codewords independently uniformly over a type class has similar performance: Lemma 3: Given, let denote the distribution that is the -fold product of, i.e., is the size of the output alphabet. Proof: See Appendix II. Given, set. Choose such that let be the distribution that achieves the supremum For a given type, let denote the distribution that is uniform over the length sequences of type. For every distribution there exists a type whose choice depends on but not on such that For, let denote the distribution for all. Here tends to as tends to infinity.

LAPIDOTH AND TELATAR: THE COMPOUND CHANNEL CAPACITY OF FINITE-STATE CHANNELS 977 Proof: Given a code of rate chosen according to the product distribution, consider the following procedure to construct an equitype subcode of rate : Find the type with the highest occurrence in (resolving ties according to some arbitrary but fixed rule). The number of codewords of this type will be lower-bounded by since the number of types is upper-bounded by. Construct the code by picking the first codewords in of type. Since is a subcode of, its average probability of error when used over the channel is upper-bounded by that of times. Since is a rom code, the type is also rom with a distribution that depends on,, but not on. Also, conditional on, the codewords in are mutually independent uniformly distributed over a set of sequences of length type. Denoting the distribution of by, we choose to satisfy. Again, this is possible since the number of types is upper-bounded by. Then thus Combining this lemma with (11), we see that for each, there is a type such that Choose such that for all denote the average (over messages codebooks) probability of error incurred over the channel with initial state when maximum-likelihood decoding is performed (with the knowledge of at the decoder). Similarly, for any decoder codebook, let denote the average (over message) probability of error incurred over the channel with initial state when the codebook decoder are used. There then exists a sequence of rate codes, a sequence of decoders (both sequences not depending on the specific channel or initial state ) such that For a detailed description of the structure of the universal decoder the family of codes of this theorem the reader is referred to [8]. Loosely speaking, the decoder is constructed by merging the maximum-likelihood decoders that correspond to each of a set of finite-state channels. To within a factor which is no bigger than the cardinality of the set, the merged decoder is shown to have a probability of error that is no worse than the probability of error incurred by any of the maximum-likelihood decoders. The set is chosen so as to have a polynomial cardinality (in the blocklength) to be dense in the sense that every finite-state channel is within distance of some channel in the set, the notion of distance between laws is made precise in [8]. We apply this theorem with, the set of sequences of type, to conclude that for there exists a sequence of rate codes of blocklength, a sequence of decoders such that the probability of error decays exponentially in for any. This approach can be easily extended to treat the case the blocklength is of the form,, by appending a constant string of length to each codeword in the codebook designed for the blocklength. One must, of course, guarantee that is negligible compared to. Then, for all (12) We will now invoke the following theorem proved in [8], to show that (12) implies the existence of a code a decoder (neither depending on or ) that perform well for every : Theorem [8]: Given an input alphabet, an output alphabet, a finite state space, let index the class of all finite-state channels with these input, output, state alphabets. Consider now a rom codebook of rate blocklength whose codewords are drawn independently uniformly over a set, let III. A CLASS OF GILBERT ELLIOTT CHANNELS In this section we study the compound channel capacity of a class of Gilbert Elliott channels. Such channels have two internal states binary input output alphabets,. The channel law of a Gilbert Elliott channel is determined by four parameters, which define the channel law through (3) is defined as follows:

978 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 44, NO. 3, MAY 1998 Theorem 2: Consider a family of Gilbert Elliott channels, let (13) If, then the compound channel capacity of the family is given by Fig. 1. The Gilbert Elliott channel model. P G P B are the channel error probabilities in the good bad states, g b are transition probabilities between states. see Fig. 1. It is straightforward to verify that the noise sequence is the capacity of the Gilbert Elliott channel with parameters. Proof: First note that the converse part of this theorem is trivial since no rate is achievable for a family of channels if it is unachievable for some member of the family. We thus focus on the direct part which we prove using Theorem 1 by choosing to be the i.i.d. Bernoulli input distribution. Noting that for this input distribution the channel output is also i.i.d. Bernoulli we deduce that ( denotes addition) is independent of the input. Thus is independent of the inputs is Bernoulli distributed if, is Bernoulli distributed if. The capacity of a single Gilbert Elliott channel is derived in [6] is achieved by a memoryless Bernoulli input distribution, irrespective of the channel parameters. It can be expressed in the following form. Let denote a stationary Markov process taking value in with the law Let be a -valued rom process, conditional on the process, forms an independent sequence of rom variables with if is a sequence of rom variables then denotes. We now have that the capacity of the Gilbert Elliott channel of parameters is given by provided that is the entropy rate of the process. Using Theorem 1 we shall now establish the compound channel capacity of a class of Gilbert Elliott channels. if if The theorem will thus be established once we show that (14) Given a channel, an initial state, the channel input, we have by (3) that the probability of the channel state being at time is given by We can now extend the notion of indecomposability [5, p. 105] to families of channels as follows. Definition 2: A family of channels defined over common finite input, output, state alphabets is uniformly indecomposable if for any there exists an such that for for all, all, all initial states. Extending [5, Theorem 4.6.4] to uniformly indecomposable families of channels we obtain that for uniformly indecomposable families of channels any input distribution (15) Under the assumptions of the theorem the family of Gilbert Elliott channels is uniformly indecomposable, it thus follows that

LAPIDOTH AND TELATAR: THE COMPOUND CHANNEL CAPACITY OF FINITE-STATE CHANNELS 979 (14) ( hence the theorem) will be established if we show (16) We now show that to establish (16) it suffices to show that (17) This reduction follows by noting that is monotonically nonincreasing, by noting that by the chain rule the expectation is taken over when has its stationary distribution. It follows that to prove (21) it suffices to prove that (22) which is proved in Appendix III. A different proof of (22), which is valid with the additional condition Next, writing (18) is outlined below. It has the advantage of being generalizable to finite-state channels with more states. The proof is based on the recursion equations for computing from, see [7, eq. (11)], [9, Lemma 2.1], references therein. recalling that given the future of is independent of its past we conclude that On the other h, by (18) (19) (20) thus, by (19) (20), it follows that to prove (17) ( hence the theorem) it suffices to show (21) In words, we need to prove that given, knowing the initial state becomes immaterial to estimating for sufficiently large, with the convergence being uniform over the family. Given an initial distribution on the channel state, let (23) is a column all-one vector. The convergence in (22) now follows from [15, Proposition A.4] on normalized products of matrices that are each of the form of a diagonal matrix multiplied by a stochastic matrix. This proposition improves on the estimates of [9, Lemma 6.1] on normalized products of matrices by exploiting the structure of the matrices. It should be noted that irrespective of the parameters, irrespective of the initial state we have that if the input distribution is i.i.d. Bernoulli then be the distribution of the channel state at time, after observing the noise process. Let denote when is chosen as the stationary distribution of, let denote when, let denote when. To prove (21) write is the binary entropy function, i.e., the expectation is taken over when has its stationary distribution. Similarly, write This observation could be used to slightly strengthen Theorem 2. In particular, if we wish to demonstrate that a rate is achievable for the family then in computing in (13) we may exclude those channels in for which. To see that some condition on the transition probabilities of the state chain is necessary, consider a class of Gilbert Elliott channels indexed by the positive integers. Specifically, let,, for. For any given, one can achieve rates exceeding over the channel by using a deep enough interleaver to make the channel look like a memoryless BSC with crossover probability. Thus However, for any given blocklength, the channel that corresponds to when started in the bad state will stay in the bad state for the duration of the transmission with

980 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 44, NO. 3, MAY 1998 probability exceeding Since in the bad state the channel output is independent from the input, we conclude that reliable communication is not possible at any rate IV. DISCUSSION AND CONCLUSIONS In this paper we have derived the capacity of a class of finite-state channels. Comparing (6) with (1) we note that the expression for the capacity is very similar to the one that holds for a class of memoryless channels, as both take the form of a of mutual informations, the maximum is over all input distributions, the infimum is over the channels in the family. The main difference is that in the expression for the capacity of a class of finite state channels there is an additional limit that allows for the dimension of the input distribution to go to infinity. This is not surprising, as such a limit is even needed to express the capacity of a single finite-state channel. This additional limit is significant, because if the convergence to the mutual information rate is not uniform over the family, unexpected compound channel capacities may result. Theorem 1 should not lead one to conjecture that the compound channel capacity of any family is of a form, as the following counterexample demonstrates [12], [13]. Let let be the digits in the binary expansion of, i.e., Proof: For each,,, the function is continuous. Furthermore, for a fixed, this class of functions is uniformly continuous in. (The modulus of continuity of can be bounded in terms of the size of the input output alphabets alone, see, e.g., [2, Lemma 2].) Thus is continuous. Since is compact, the supremum over of this function is attained. Lemma 5: Given,,,,, as in Lemma 1. Then Proof: For a given, let be the distributed according to We want to show that (For definiteness, for a rational of the form, take the terminating expansion among the two possible binary expansions.) Consider now the set of channels defined over common binary input output alphabets over the channel the conditional probability of the output sequence given the input sequence is given by if otherwise denotes addition. For this family of channels we have that the expression yields a value of irrespective of the blocklength, yet the compound channel capacity of this family is, as can be easily verified using stard techniques from the theory of arbitrarily varying channels, see [13], [14, Appendix]. This example can be explained by noting that for this family of channels there does not exist a universal decoder [8]. To that end, write The first term is lower-bounded by. For the second term, note that we have used the independence of.we now note that APPENDIX I We prove Proposition 1 using the following sequence of lemmas. Lemma 4: For each, there exists such that (this follows from the inequality see [5, p. 112])

LAPIDOTH AND TELATAR: THE COMPOUND CHANNEL CAPACITY OF FINITE-STATE CHANNELS 981 we obtain Combining the above Corollary 1: The sequence converges. Furthermore, (24) (26) (25) Proof: Given integers, let achieve the maximizations in, respectively. Let define as in the proof of Lemma 5. Then We now differentiate to find or (27) Since is also bounded (say by ), we conclude that (see [5, p. 112]) Since for all, the second term above is zero. We now compute Lemma 4 the above corollary establish Proposition 1. APPENDIX II Proof of Lemma 2: This lemma is a slight modification of [5, Problem 5.23]. The reason we chose to include a proof is twofold: first, the solutions to the problems in [5] are not widely available, second, the official solutions for this problem contain a small error. By exping in a power series around to second order, we obtain for some so it suffices to show that for all. Differentiating

982 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 44, NO. 3, MAY 1998 Substituting this in (27) we obtain We will upper-bound the right-h side further by maximizing over subject to,. To that end, exp the square maximize each term (28) Using the inequality on the second term in brackets combining the first two terms The second maximization gives. For the first, write Since the last term is nonpositive we can drop it to leave Observe now that for,. Thus the next to last equality follows from the concavity of in the range, thus the maximization taking place when. Thus Note that for as was to be shown. Furthermore, APPENDIX III In this Appendix we prove (22). For a given Gilbert Elliott channel define thus Mushkin Bar-David [6, Appendix] show that the quantity in (22) is upper-bounded by can thus prove (22) as long as for some. We

LAPIDOTH AND TELATAR: THE COMPOUND CHANNEL CAPACITY OF FINITE-STATE CHANNELS 983 for all channels in our class. We will show that under the assumptions of Theorem 2, namely, Taking the second possibility yields the same lower bound. We thus conclude that, which proves (22). (29) is satisfied can be taken to be. To that end, suppose for some. This implies that either, in which case, or else,, in which case. Take the first alternative. implies,, ACKNOWLEDGMENT Stimulating discussions with R. Gallager, S. Mitter, R. Urbanke, J. Ziv are gratefully acknowledged. The authors wish to thank O. Zeitouni for bringing [15] to their attention, the anonymous referees for their helpful comments. REFERENCES implies, From (30) (31), we conclude that We can rewrite (32) as thus, implying. (30) (31) (32) [1] I. Csiszár J. Körner, Information Theory: Coding Theorems for Discrete Memoryless Systems. New York: Academic, 1981. [2] D. Blackwell, L. Breiman, A. J. Thomasian, The capacity of a class of channels, Ann. Math. Stat., vol. 30, pp. 1229 1241, Dec. 1959. [3] J. Wolfowitz, Coding Theorems of Information Theory, 3rd ed. Berlin/Heidelberg: Springer-Verlag, 1978. [4] W. L. Root P. P. Varaiya, Capacity of classes of Gaussian channels, SIAM J. Appl. Math., vol. 16, pp. 1350 1393, Nov. 1968. [5] R. G. Gallager, Information Theory Reliable Communication. New York: Wiley, 1968. [6] M. Mushkin I. Bar-David, Capacity coding for the Gilbert Elliott channel, IEEE Trans. Inform. Theory, vol. 35, pp. 1277 1290, Nov. 1989. [7] A. J. Goldsmith P. P. Varaiya, Capacity, mutual information, coding for finite-state Markov channels, IEEE Trans. Inform. Theory, vol. 42, pp. 868 886, May 1996. [8] M. Feder A. Lapidoth, Universal decoding for noisy channels, Mass. Inst. Technol., Lab. Inform. Decision Syst., Tech. Rep. LIDS- P-2377, Dec. 1996; see also M. Feder A. Lapidoth, Universal decoding for channels with memory, IEEE Trans. Inform. Theory, to be published. [9] T. Kaijser, A limit theorem for partially observed Markov chains, Ann. Prob., vol. 3, no. 4, pp. 677 696, 1975. [10] H. Furstenberg H. Kesten, Products of rom matrices, Ann. Math. Statist., vol. 31, pp. 457 469, 1960. [11] G. Bratt, Sequential decoding for the Gilbert Elliott channel Strategy analysis, Ph.D. dissertation, Lund University, Lund, Sweden, June 1994. [12] R. L. Dobrushin, Optimum information transmission through a channel with unknown parameters, Radiotekh. Elektron., vol. 4, no. 12, pp. 1951 1956, 1959. [13] D. Blackwell, L. Breiman, A. Thomasian, The capacities of certain channel classes under rom coding, Ann. Math. Statist., vol. 31, pp. 558 567, 1960. [14] I. Csiszár P. Narayan, Capacity of the Gaussian arbitrarily varying channel, IEEE Trans. Inform. Theory, vol. 37, pp. 18 26, Jan. 1991. [15] F. LeGl L. Mevel, Geometric ergodicity in hidden Markov models, Institute de Recherche en Informatique et Systémes Aléatoires, IRISA pub. no. 1028, July 1996.