The mathematics behind wireless communication

June 2008

Questions and setting In wireless communication, information is sent through what is called a channel. The channel is subject to noise, so that there will be some loss of information. How should we send information so that there is as little information loss as possible? How should we dene the capacity of a channel? Can we nd an expression for the capacity from the characteristics of the channel?

What is information? Assume that the random variable X takes values in the alphabet X = {α 1, α 2,...}. Set p i = Pr(X = α i ). How can we dene a measure H for how much choice/uncertainty/information is associated with each outcome? Shannon [1] proposed the following requirements for H: 1 H should be continous in the p i. 2 If all the p i are equal (p i = 1 n ), then H should be an increasing function of n (with equally likely events there is more uncertainty when there are more possible events). 3 If a choice can be broken down into successive choices, the original H should be the weighted sum of the individual values of H: A choice between {α 1, α 2, α 3 } can rst be split into a choice between {α 1, {α 2, α 3 }}, followed by an alternative choice between {α 2, α 3 }.

Entropy Denition The entropy of X is dened by H(X ) = H(p 1, p 2,...) = i p i log 2 p i The entropy is measured in bits. Shannon showed that an information measure which satises the requirements of the previous foil, necessarily has this form! If p 1 = 1, p 2 = 1, p 3 = 1, the weighting described on the previous 2 3 6 foil can be veried as ( 1 H 2, 1 3, 1 ) ( 1 = H 6 2, 1 ) + 1 ( 2 2 2 H 3, 1 ), 3 where the weight 1 appearing on the right side is computed as 2 p 2 + p 3 = 1. 2

Shannon's source coding theorem We would like to represent data generated by the random variable X in a shorter way (i.e. compress). Shannon's source coding theorem addresses the limits of such compression: Theorem Assume that we have independent outcomes of the random variable X (= x 1 x 2 x 3 ) The average number of bits per symbol for any lossless compression strategy is always greater than or equal to the entropy H(X ). The entropy H is therefore a lower limit for achievable compression. The theoretical limit given by the entropy is also achievable. In a previous talk, I focused on methods for achieving the limit given by the entropy (Human coding, arithmetic coding).

Sketch of Shannon's proof There exists a subset A (n) ɛ of all length-n sequences (x 1, x 2,..., x n ) such that The size of A (n) ɛ is 2 nh(x ) (which can be small when compared to the number of all sequences). Pr(A (n) ɛ ) > 1 ɛ. A (n) ɛ is called the typical set, and consists of all (x 1, x 2,..., x n ) with 1 n log 2(p(x 1, x 2,..., x n ) (=empirical entropy) close enough to the actual entropy H(X ). Shannon proved the source coding theorem by 1 assigning codes with a (smaller) xed length to ALL elements in the typical set, 2 assigning codes with another (longer) xed length to ALL elements outside the typical set, 3 letting n, and ɛ 0.

What is a communication channel? That A communicates with B means that the physical acts of A induce a desired physical state in B. This transfer of information is subject to noise and the imperfections of the physical signaling process itself. The communication is succesful if the receiver B and the transmitter A agree on what was sent. Denition A discrete channel, denoted by (X, p(y x), Y), consists of two nite sets X (the input alphabet) and Y (the output alphabet), and a probability transition matrix p(y x) that expresses the probability of observing the output symbol y given that we send the symbol x. The channel is said to be memoryless if the probability distribution of the output depends only on the input at that time, and is conditionally independent of previous channel inputs and outputs.

A general scheme for communication W Encoder X n Channel p(y x) Y n Decoder Ŵ W {1, 2,..., M} is the message we seek to transfer via the channel The encoder is a map X n : {1, 2,..., M} X n, taking values in a codebook from X n of size M (X n (1), X n (2),..., X n (M)). The decoder is a map Y n : Y n {1, 2,..., M}. This is a deterministic rule that assigns a guess to each possible received vector. Ŵ {1, 2,..., M} is the message retrieved by the decoder. n is the block length. It says how many times the channel is used for each transmission. M is the number of possible messages. A message can thus be represented with log 2 (M) bits.

The encoder/decoder pair is called a (M, n)-code (i.e. codes where there are M possible messages, n uses of the channel per transmission). When the encoder maps the input to codewords in the data transmission process, it actually adds redundancy in a controlled fashion to combat errors in the channel. This is in contrast to data compression, where one goes the opposite way, i.e. removing redundancy in the data to form the most compressed form possible. The basic question is how one can construct an encoder/decoder pair, such that there is a high probability that the received message Ŵ equals the transmitted message W?

Denition Let λ W be the probability that the received message Ŵ is dierent from the sent message W. This is called the conditional probability of error given that W was sent. We also dene the maximal probability of error as λ (n) = max W {1,2,...,M} λ W. Denition The rate of an (M.n)-code is dened as R = log 2 (M) n, measured in bits per transmission. Denition A rate R is said to be achievable if there for each n exists a ( 2 nr, n)-code, such that lim n λ (n) = 0 (i.e. the maximal probability of error goes to 0). Denition The (operational) capacity of a channel is the supremum of all achievable rates.

Shannon's channel coding theorem Expresses the capacity in terms of the probability distribution of the channel, irrespective of the use of encoders/decoders. Theorem The capacity of a discrete memoryless channel is given by C = max I (X ; Y ), q(x) where X /Y is the random input/output to the channel, with X having distribution q(x) on X. Here I (X ; Y ) is the mutual information between the random variables X and Y, dened by I (X ; Y ) = x,y p(x, y) log 2 ( p(x, y) p(x)p(y) where p(x, y) is the joint p.d.f. of X and Y. ), (1)

Sketch of proof I We generalize the denition of the typical set (from the proof of the source coding theorem) to the following: The jointly typical set consists of all jointly typical sequences ((x n ), (y n )) = ((x 1, x 2,..., x n ), (y 1, y 2,..., y n )), dened as those sequences where 1 the empirical entropy of (x 1, x 2,..., x n ) is close enough to the actual entropy H(X ), 2 the empirical entropy of (y 1, y 2,..., y n ) is close enough to the actual entropy H(Y ), 3 the joint empirical entropy ( 1 n log ( n p(x 2 i=1 i, y i ))) of ((x 1, x 2,..., x n ), (y 1, y 2,..., y n )) is close enough to the actual joint entropy H(X, Y ) dened by H(X, Y ) = p(x, y) log 2 p(x, y), x X y Y where p(x, y) is the joint distribution of X and Y.

Sketch of proof II The jointly typical set is, just as the typical set, denoted A (n) ɛ. It has the following property similar to the corresponding properties for the typical set: 1 The size of A (n) ɛ is approximately 2 nh(x,y ) (which is small when compared to the number of all sequences). 2 Pr(A (n) ɛ ) 1 as n.

Sketch of proof III The channel coding theorem can be proved in the following way for a given rate R < C : 1 Construct a randomly (dictated by some xed distribution of the input) generated codebook of length 2 nr from X n. Dene the encoder as any mapping from {1,..., 2 nr } into this set. 2 Dene the decoder in the following way if the output (y1, y2,..., yn) of the channel is jointly typical with a unique (x 1,...x n), dene (x 1,...x n) as the output of the decoder Otherwise, the output of the decoder should be some dummy index, declaring an error. 3 One can show that, with high probability (going to 1 as n ), the input to the channel (x 1, x 2,..., x n ) is jointly typical with the output (y 1, y 2,..., y n ). The expression for the mutual information enters the picture when computing the probability that the output is jointly typical with another sequence, which is 2 ni (X ;Y ).

More general channels I In general, channels do not use nite alphabet inputs/outputs. The most important continous alphabet channel is the Gaussian channel. This is a time-discrete channel with output Y i at time i given by Y i = X i + Z i. X i is input, Z i N (0, N) noise (Gaussian, variance N). Capacity can be dened in a similar fashion for such channels The capacity can be innite, unless we restrict the input. The most common such restriction is a limitation on its variance. Assume that the variance of the input is less than P. One can then show that the capacity of the Gaussian channel is ( 1 2 log 1 + P ), 2 N and that the capacity is achieved when X N (0, P).

More general channels II In general, communication systems consist of multiple transmitters and receivers, talking and interfering with each other. Such communication systems are described by a channel matrix, whose dimensions match the number of transmitters and receivers. Its entries is a function of the geometry of the transmitting and receiving antennas. Capacity can be described in a meaningful way for such systems also. It turns out that, for a wide class of channels, the capacity is given by C = 1 ( n log det I 2 n + ρ 1 ) m HHH where H is the n m channel matrix, n,m is the number of receiving/transmitting antennas, ρ is signal to noise ratio (like for the Gaussian channel). P N

Active areas of research and open problems How do we construct codebooks which help us achieve rates close to the capacity? In other words, how can we nd the input distribution p(x) which maximizes I (X ; Y ) (the mutual information between the input and the output)? Such codes should also be implementable. Much progress made in recent years. Convolutional codes, Turbo codes, LDPC (Low-Density Parity Check) codes. Error correcting codes: These codes are able to detect where bit errors have occured in the received data. Hamming codes. What is the capacity in more general systems? One has to account for any number of receivers/transmitters, any type of interference, cooperation and feedback between the sending and receiving antennas. General case far from being solved.

Good sources on information theory are the books [2] (which most of these foils are based on), and [3]. Related courses at UNIK: UNIK4190, UNIK4220, UNIK4230. Related courses at NTNU: TTT4125, TTT4110. This talk is available at http://heim.i.uio.no/ oyvindry/talks.shtml. My publications are listed at http://heim.i.uio.no/ oyvindry/publications.shtml

C. E. Shannon, A mathematical theory of communication, The Bell System Technical Journal, vol. 27, pp. 379423,623656, October 1948. T. M. Cover and J. A. Thomas, Elements of Information Theory, second edition. Wiley, 2006. D. J. MacKay, Information Theory, Inference, and Learning Algorithms. Cambridge University Press, 2003.