Channel polarization: A method for constructing capacity-achieving codes for symmetric binary-input memoryless channels

Transcription

1 Channel polarization: A method for constructing capacity-achieving codes for symmetric binary-input memoryless channels Erdal Arıkan, Senior Member, IEEE arxiv:007397v5 [csit] 0 Jul 009 Abstract A method is proposed, called channel polarization, to construct code sequences that achieve the symmetric capacity I() of any given binary-input discrete memoryless channel (B- DMC) The symmetric capacity is the highest rate achievable subject to using the input letters of the channel with equal probability Channel polarization refers to the fact that it is possible to synthesize, out of independent copies of a given B-DMC, a second set of binary-input channels { (i) : i } such that, as becomes large, the fraction of indices i for which I( (i) ) is near approaches I() and the fraction for which I( (i) ) is near 0 approaches I() The polarized channels { (i) } are well-conditioned for channel coding: one need only send data at rate through those with capacity near and at rate 0 through the remaining Codes constructed on the basis of this idea are called polar codes The paper proves that, given any B-DMC with I() > 0 and any target rate R < I(), there exists a sequence of polar codes {C n; n } such that C n has block-length n, rate R, and probability of block error under successive cancellation decoding bounded as P e(, R) O( ) independently of the code rate This performance is achievable by encoders and decoders with complexity O( log ) for each Index Terms Capacity-achieving codes, channel capacity, channel polarization, Plotkin construction, polar codes, Reed- Muller codes, successive cancellation decoding I ITRODUCTIO AD OVERVIE A fascinating aspect of Shannon s proof of the noisy channel coding theorem is the random-coding method that he used to show the existence of capacity-achieving code sequences without exhibiting any specific such sequence [] Explicit construction of provably capacity-achieving code sequences with low encoding and decoding complexities has since then been an elusive goal This paper is an attempt to meet this goal for the class of B-DMCs e will give a description of the main ideas and results of the paper in this section First, we give some definitions and state some basic facts that are used throughout the paper A Preliminaries e write : X Y to denote a generic B-DMC with input alphabet X, output alphabet Y, and transition E Arıkan is with the Department of Electrical-Electronics Engineering, Bilkent University, Ankara, 0600, Turkey ( arikan@eebilkentedutr) This work was supported in part by The Scientific and Technological Research Council of Turkey (T UBITAK) under Project 07E6 and in part by the European Commission FP7 etwork of Excellence ECOM under contract 675 probabilities (y x), x X, y Y The input alphabet X will always be {0, }, the output alphabet and the transition probabilities may be arbitrary e write to denote the channel corresponding to uses of ; thus, : X Y with (y x ) i (y i x i ) Given a B-DMC, there are two channel parameters of primary interest in this paper: the symmetric capacity I() (y x)log (y x) y Y x X (y 0) (y ) and the Bhattacharyya parameter Z() y Y (y 0)(y ) These parameters are used as measures of rate and reliability, respectively I() is the highest rate at which reliable communication is possible across using the inputs of with equal frequency Z() is an upper bound on the probability of maximum-likelihood (ML) decision error when is used only once to transmit a 0 or It is easy to see that Z() takes values in [0, ] Throughout, we will use base- logarithms; hence, I() will also take values in [0, ] The unit for code rates and channel capacities will be bits Intuitively, one would expect that I() iff Z() 0, and I() 0 iff Z() The following bounds, proved in the Appendix, make this precise Proposition : For any B-DMC, we have I() log Z(), () I() Z() () The symmetric capacity I() equals the Shannon capacity when is a symmetric channel, ie, a channel for which there exists a permutation π of the output alphabet Y such that (i) π π and (ii) (y ) (π(y) 0) for all y Y The binary symmetric channel (BSC) and the binary erasure channel (BEC) are examples of symmetric channels A BSC is a B-DMC with Y {0, }, (0 0) ( ), and ( 0) (0 ) A B-DMC is called a BEC if for each y Y, either (y 0)(y ) 0 or (y 0) (y ) In the latter case, y is said to be an erasure symbol The sum of (y 0) over all erasure symbols y is called the erasure probability of the BEC

2 e denote random variables (RVs) by upper-case letters, such as X, Y, and their realizations (sample values) by the corresponding lower-case letters, such as x, y For X a RV, P X denotes the probability assignment on X For a joint ensemble of RVs (X, Y ), P X,Y denotes the joint probability assignment e use the standard notation I(X; Y ), I(X; Y Z) to denote the mutual information and its conditional form, respectively e use the notation a as shorthand for denoting a row vector (a,,a ) Given such a vector a, we write a j i, i, j, to denote the subvector (a i,, a j ); if j < i, a j i is regarded as void Given a and A {,, }, we write a A to denote the subvector (a i : i A) e write a j,o to denote the subvector with odd indices (a k : k j; k odd) e write a j,e to denote the subvector with even indices (a k : k j; k even) For example, for a 5 (5,, 6,, ), we have a (, 6, ), a5,e (, ), a,o (5, 6) The notation 0 is used to denote the all-zero vector Code constructions in this paper will be carried out in vector spaces over the binary field GF() Unless specified otherwise, all vectors, matrices, and operations on them will be over GF() In particular, for a, b vectors over GF(), we write a b to denote their componentwise mod- sum The Kronecker product of an m-by-n matrix A [A ij ] and an r-by-s matrix B [B ij ] is defined as A B A B A n B A m B A mn B, which is an mr-by-ns matrix The Kronecker power A n is defined as A A (n ) for all n e will follow the convention that A 0 [] e write A to denote the number of elements in a set A e write A to denote the indicator function of a set A; thus, A (x) equals if x A and 0 otherwise e use the standard Landau notation O(), o(), ω() to denote the asymptotic behavior of functions B Channel polarization Channel polarization is an operation by which one manufactures out of independent copies of a given B-DMC a second set of channels { (i) : i } that show a polarization effect in the sense that, as becomes large, the symmetric capacity terms {I( (i) )} tend towards 0 or for all but a vanishing fraction of indices i This operation consists of a channel combining phase and a channel splitting phase ) Channel combining: This phase combines copies of a given B-DMC in a recursive manner to produce a vector channel : X Y, where can be any power of two, n, n 0 The recursion begins at the 0-th level (n 0) with only one copy of and we set The first level (n ) of the recursion combines two independent copies of as shown in Fig and obtains the channel : X Y with the transition probabilities (y, y u, u ) (y u u )(y u ) (3) u u Fig The channel x x The next level of the recursion is shown in Fig where two independent copies of are combined to create the channel : X Y with transition probabilities (y u ) (y u u, u 3 u ) (y 3 u, u ) u u u 3 u R v v v 3 v x x 3 y y x x Fig The channel and its relation to and In Fig, R is the permutation operation that maps an input (s, s, s 3, s ) to v (s, s 3, s, s ) The mapping u x from the input of to the input of can be written as x u G with G [ y y y 3 y ] Thus, we have the relation (y u ) (y u G ) between the transition probabilities of and those of The general form of the recursion is shown in Fig 3 where two independent copies of / are combined to produce the channel The input vector u to is first transformed into s so that s i u i u i and s i u i for i / The operator R in the figure is a permutation, known as the reverse shuffle operation, and acts on its input s to produce v (s, s 3,, s, s, s,, s ), which becomes the input to the two copies of / as shown in the figure e observe that the mapping u v is linear over GF() It follows by induction that the overall mapping u x, from the input of the synthesized channel to the input of the underlying raw channels, is also linear and may be represented by a matrix G so that x u G e call G

3 3 u s v y u s v y u / s / v / / y / u / s / v / y / 3) Channel polarization: Theorem : For any B-DMC, the channels { (i) } polarize in the sense that, for any fixed δ (0, ), as goes to infinity through powers of two, the fraction of indices i {,, } for which I( (i) ) ( δ, ] goes to I() and the fraction for which I( (i) ) [0, δ) goes to I() This theorem is proved in Sect IV 09 u / s / R v / / y / u / s / v / y / u s v y Symmetric capacity u s v y Fig 3 Recursive construction of from two copies of / the generator matrix of size The transition probabilities of the two channels and are related by (y u ) (y u G ) () for all y Y, u X e will show in Sect VII that G equals B F n for any n, n 0, where B is a permutation matrix known as bit-reversal and F [ 0 ] ote that the channel combining operation is fully specified by the matrix F Also note that G and F n have the same set of rows, but in a different (bit-reversed) order; we will discuss this topic more fully in Sect VII ) Channel splitting: Having synthesized the vector channel out of, the next step of channel polarization is to split back into a set of binary-input coordinate channels (i) : X Y X i, i, defined by the transition probabilities (i) (y, u i u i ) (y u ), (5) u i X i where (y, u i ) denotes the output of (i) and u i its input To gain an intuitive understanding of the channels { (i) }, consider a genie-aided successive cancellation decoder in which the ith decision element estimates u i after observing y and the past channel inputs u i (supplied correctly by the genie regardless of any decision errors at earlier stages) If u is a-priori uniform on X, then (i) is the effective channel seen by the ith decision element in this scenario Fig Channel index Plot of I( (i) ) vs i,, 0 for a BEC with ǫ 05 The polarization effect is illustrated in Fig for the case is a BEC with erasure probability ǫ 05 The numbers )} have been computed using the recursive relations {I( (i) I( (i ) ) I( (i) / ), I( (i) ) I( (i) / ) I( (i) / ), with I( () ) ǫ This recursion is valid only for BECs and it is proved in Sect III o efficient algorithm is known for calculation of {I( (i) )} for a general B-DMC Figure shows that I( (i) ) tends to be near 0 for small i and near for large i However, I( (i) ) shows an erratic behavior for an intermediate range of i For general B-DMCs, determining the subset of indices i for which I( (i) ) is above a given threshold is an important computational problem that will be addressed in Sect IX ) Rate of polarization: For proving coding theorems, the speed with which the polarization effect takes hold as a function of is important Our main result in this regard is given in terms of the parameters Z( (i) ) y Y u i X i (6) (i) (y, ui 0) (i) (y, ui ) Theorem : For any B-DMC with I() > 0, and any fixed R < I(), there exists a sequence of sets A {,, }, {,,, n, }, such that A R and Z( (i) ) O( 5/ ) for all i A This theorem is proved in Sect IV-B (7)

4 e stated the polarization result in Theorem in terms {Z( (i) (i) )} rather than {I( )} because this form is better suited to the coding results that we will develop A rate of polarization result in terms of {I( (i) )} can be obtained from Theorem with the help of Prop C Polar coding e take advantage of the polarization effect to construct codes that achieve the symmetric channel capacity I() by a method we call polar coding The basic idea of polar coding is to create a coding system where one can access individually and send data only through those for which Z( (i) ) is near 0 ) G -coset codes: e first describe a class of block codes that contain polar codes the codes of main interest as a special case The block-lengths for this class are restricted to powers of two, n for some n 0 For a given, each code in the class is encoded in the same manner, namely, each coordinate channel (i) x u G () where G is the generator matrix of order, defined above For A an arbitrary subset of {,,}, we may write () as x u AG (A) u A cg (A c ) (9) where G (A) denotes the submatrix of G formed by the rows with indices in A If we now fix A and u A c, but leave u A as a free variable, we obtain a mapping from source blocks u A to codeword blocks x This mapping is a coset code: it is a coset of the linear block code with generator matrix G (A), with the coset determined by the fixed vector u A cg (A c ) e will refer to this class of codes collectively as G -coset codes Individual G -coset codes will be identified by a parameter vector (, K, A, u A c), where K is the code dimension and specifies the size of A The ratio K/ is called the code rate e will refer to A as the information set and to u A c X K as frozen bits or vector For example, the (,, {, }, (, 0)) code has the encoder mapping x u G (u, u ) [ ] [ ] (, 0) (0) 0 0 For a source block (u, u ) (, ), the coded block is x (,, 0, ) Polar codes will be specified shortly by giving a particular rule for the selection of the information set A ) A successive cancellation decoder: Consider a G -coset code with parameter (, K, A, u A c) Let u be encoded into a codeword x, let x be sent over the channel, and let a channel output y be received The decoder s task is to generate an estimate û of u, given knowledge of A, u A c, and y Since the decoder can avoid errors in the frozen part e include the redundant parameter K in the parameter set because often we consider an ensemble of codes with K fixed and A free by setting û A c u A c, the real decoding task is to generate an estimate û A of u A The coding results in this paper will be given with respect to a specific successive cancellation (SC) decoder, unless some other decoder is mentioned Given any (, K, A, u A c) G - coset code, we will use a SC decoder that generates its decision û by computing û i { u i, if i A c h i (y, ûi ), if i A () in the order i from to, where h i : Y X i X, i A, are decision functions defined as h i (y, ûi ) 0, if (i) (y,ûi 0) (i) (y,ûi ) (), otherwise for all y Y, û i X i e will say that a decoder block error occurred if û u or equivalently if û A u A The decision functions {h i } defined above resemble ML decision functions but are not exactly so, because they treat the future frozen bits (u j : j > i, j A c ) as RVs, rather than as known bits In exchange for this suboptimality, {h i } can be computed efficiently using recursive formulas, as we will show in Sect II Apart from algorithmic efficiency, the recursive structure of the decision functions is important because it renders the performance analysis of the decoder tractable Fortunately, the loss in performance due to not using true ML decision functions happens to be negligible: I() is still achievable 3) Code performance: The notation P e (, K, A, u A c) will denote the probability of block error for a (, K, A, u A c) code, assuming that each data vector u A X K is sent with probability K and decoding is done by the above SC decoder More precisely, P e (, K, A, u A c) u A X K K y Y : û (y ) u (y u ) The average of P e (, K, A, u A c) over all choices for u A c will be denoted by P e (, K, A): P e (, K, A) u A c X K K P e(, K, A, u A c) A key bound on block error probability under SC decoding is the following Proposition : For any B-DMC and any choice of the parameters (, K, A), P e (, K, A) i A Z( (i) ) (3) Hence, for each (, K, A), there exists a frozen vector u A c such that P e (, K, A, u A c) i A Z( (i) ) () This is proved in Sect V-B This result suggests choosing A from among all K-subsets of {,,} so as to minimize

5 5 the RHS of (3) This idea leads to the definition of polar codes ) Polar codes: Given a B-DMC, a G -coset code with parameter (, K, A, u A c) will be called a polar code for if the information set A is chosen as a K-element subset of {,,} such that Z( (i) (j) ) Z( ) for all i A, j A c Polar codes are channel-specific designs: a polar code for one channel may not be a polar code for another The main result of this paper will be to show that polar coding achieves the symmetric capacity I() of any given B-DMC An alternative rule for polar code definition would be to specify A as a K-element subset of {,,} such that I( (i) ) I( (j) ) for all i A, j Ac This alternative rule would also achieve I() However, the rule based on the Bhattacharyya parameters has the advantage of being connected with an explicit bound on block error probability The polar code definition does not specify how the frozen vector u A c is to be chosen; it may be chosen at will This degree of freedom in the choice of u A c simplifies the performance analysis of polar codes by allowing averaging over an ensemble However, it is not for analytical convenience alone that we do not specify a precise rule for selecting u A c, but also because it appears that the code performance is relatively insensitive to that choice In fact, we prove in Sect VI-B that, for symmetric channels, any choice for u A c is as good as any other 5) Coding theorems: Fix a B-DMC and a number R 0 Let P e (, R) be defined as P e (, R, A) with A selected in accordance with the polar coding rule for Thus, P e (, R) is the probability of block error under SC decoding for polar coding over with block-length and rate R, averaged over all choices for the frozen bits u A c The main coding result of this paper is the following: Theorem 3: For any given B-DMC and fixed R < I(), block error probability for polar coding under successive cancellation decoding satisfies P e (, R) O( ) (5) This theorem follows as an easy corollary to Theorem and the bound (3), as we show in Sect V-B For symmetric channels, we have the following stronger version of Theorem 3 Theorem : For any symmetric B-DMC and any fixed R < I(), consider any sequence of G -coset codes (, K, A, u A c) with increasing to infinity, K R, A chosen in accordance with the polar coding rule for, and u A c fixed arbitrarily The block error probability under successive cancellation decoding satisfies P e (, K, A, u A c) O( ) (6) This is proved in Sect VI-B ote that for symmetric channels I() equals the Shannon capacity of 6) Complexity: An important issue about polar coding is the complexity of encoding, decoding, and code construction The recursive structure of the channel polarization construction leads to low-complexity encoding and decoding algorithms for the class of G -coset codes, and in particular, for polar codes Theorem 5: For the class of G -coset codes, the complexity of encoding and the complexity of successive cancellation decoding are both O( log ) as functions of code blocklength This theorem is proved in Sections VII and VIII otice that the complexity bounds in Theorem 5 are independent of the code rate and the way the frozen vector is chosen The bounds hold even at rates above I(), but clearly this has no practical significance As for code construction, we have found no low-complexity algorithms for constructing polar codes One exception is the case of a BEC for which we have a polar code construction algorithm with complexity O() e discuss the code construction problem further in Sect IX and suggest a lowcomplexity statistical algorithm for approximating the exact polar code construction D Relations to previous work This paper is an extension of work begun in [], where channel combining and splitting were used to show that improvements can be obtained in the sum cutoff rate for some specific DMCs However, no recursive method was suggested there to reach the ultimate limit of such improvements As the present work progressed, it became clear that polar coding had much in common with Reed-Muller (RM) coding [3], [] Indeed, recursive code construction and SC decoding, which are two essential ingredients of polar coding, appear to have been introduced into coding theory by RM codes According to one construction of RM codes, for any n, n 0, and 0 K, an RM code with blocklength and dimension K, denoted RM(, K), is defined as a linear code whose generator matrix G RM (, K) is obtained by deleting ( K) of the rows of F n so that none of the deleted rows has a larger Hamming weight (number of s in that row) than any of the remaining ] K rows For instance, G RM (, ) F [ and G RM (, ) [ 0 0 ] This construction brings out the similarities between RM codes and polar codes Since G and F n have the same set of rows (only in a different order) for any n, it is clear that RM codes belong to the class of G -coset codes For example, RM(, ) is the G -coset code with parameter (,, {, }, (0, 0)) So, RM coding and polar coding may be regarded as two alternative rules for selecting the information set A of a G -coset code of a given size (, K) Unlike polar coding, RM coding selects the information set in a channelindependent manner; it is not as fine-tuned to the channel polarization phenomenon as polar coding is e will show in Sect X that, at least for the class of BECs, the RM rule for information set selection leads to asymptotically unreliable codes under SC decoding So, polar coding goes beyond RM coding in a non-trivial manner by paying closer attention to channel polarization Another connection to existing work can be established by noting that polar codes are multi-level u u v codes, which are a class of codes originating from Plotkin s method for code combining [5] This connection is not surprising in

6 6 view of the fact that RM codes are also multi-level u u v codes [6, pp -5] However, unlike typical multi-level code constructions where one begins with specific small codes to build larger ones, in polar coding the multi-level code is obtained by expurgating rows of a full-order generator matrix, G, with respect to a channel-specific criterion The special structure of G ensures that, no matter how expurgation is done, the resulting code is a multi-level u u v code In essence, polar coding enjoys the freedom to pick a multi-level code from an ensemble of such codes so as to suit the channel at hand, while conventional approaches to multi-level coding do not have this degree of flexibility Finally, we wish to mention a spectral interpretation of polar codes which is similar to Blahut s treatment of BCH codes [7, Ch 9]; this type of similarity has already been pointed out by Forney [, Ch ] in connection with RM codes From the spectral viewpoint, the encoding operation () is regarded as a transform of a frequency domain information vector u to a time domain codeword vector x The transform is invertible with G G The decoding operation is regarded as a spectral estimation problem in which one is given a time domain observation y, which is a noisy version of x, and asked to estimate u To aid the estimation task, one is allowed to freeze a certain number of spectral components of u This spectral interpretation of polar coding suggests that it may be possible to treat polar codes and BCH codes in a unified framework The spectral interpretation also opens the door to the use of various signal processing techniques in polar coding; indeed, in Sect VII, we exploit some fast transform techniques in designing encoders for polar codes E Paper outline The rest of the paper is organized as follows Sect II explores the recursive properties of the channel splitting operation In Sect III, we focus on how I() and Z() get transformed through a single step of channel combining and splitting e extend this to an asymptotic analysis in Sect IV and complete the proofs of Theorem and Theorem This completes the part of the paper on channel polarization; the rest of the paper is mainly about polar coding Section V develops an upper bound on the block error probability of polar coding under SC decoding and proves Theorem 3 Sect VI considers polar coding for symmetric B-DMCs and proves Theorem Sect VII gives an analysis of the encoder mapping G, which results in efficient encoder implementations In Sect VIII, we give an implementation of SC decoding with complexity O( log ) In Sect IX, we discuss the code construction complexity and propose an O( log ) statistical algorithm for approximate code construction In Sect X, we explain why RM codes have a poor asymptotic performance under SC decoding In Sect XI, we point out some generalizations of the present work, give some complementary remarks, and state some open problems II RECURSIVE CHAEL TRASFORMATIOS e have defined a blockwise channel combining and splitting operation by () and (5) which transformed independent copies of into (),, () The goal in this section is to show that this blockwise channel transformation can be broken recursively into single-step channel transformations e say that a pair of binary-input channels : X Ỹ and : X Ỹ X are obtained by a single-step transformation of two independent copies of a binary-input channel : X Y and write (, ) (, ) iff there exists a one-to-one mapping f : Y Ỹ such that (f(y, y ) u ) u (y u u )(y u ), (7) (f(y, y ), u u ) (y u u )(y u ) () for all u, u X, y, y Y According to this, we can write (, ) ( (), () ) for any given B-DMC because () (y u ) u (y u ) () (y, u u ) (y u ) u (y u u )(y u ), (9) (y u u )(y u ), (0) which are in the form of (7) and () by taking f as the identity mapping It turns out we can write, more generally,, (i) (i ) ) (, (i) ) () ( (i) This follows as a corollary to the following: Proposition 3: For any n 0, n, i, and (i ) (y, ui u i ) (i) (y, u,o i u,e i u i u i ) u i (i) (y, ui,e u i) () (i) (y, ui u i ) (i) (y, u,o i u,e i u i u i ) (i) (y, ui,e u i) (3) This proposition is proved in the Appendix The transform relationship () can now be justified by noting that () and (3) are identical in form to (7) and (), respectively, after the following substitutions: (i), (i ), (i), u u i, u u i, y (y, ui,o u,e i ), y (y, u,e i ), f(y, y ) (y, u i )

7 7 Fig 5 () (5) (3) (7) () (6) () () () (3) () () () (3) () () () () () () () () () () The channel transformation process with channels ( (), () Thus, we have shown that the blockwise channel transformation from to ( () (),, ) breaks at a local level into single-step channel transformations of the form () The full set of such transformations form a fabric as shown in Fig 5 for Reading from right to left, the figure starts with four copies of the transformation (, ) ) and continues in butterfly patterns, each representing a channel transformation of the form ( (j), (j) i ) i ( (j ), (j) i ) The two channels at the right end-points i of the butterflies are always identical and independent At the rightmost level there are independent copies of ; at the next level to the left, there are independent copies of () and () each; and so on Each step to the left doubles the number of channel types, but halves the number of independent copies III TRASFORMATIO OF RATE AD RELIABILITY e now investigate how the rate and reliability parameters, I( (i) (i) ) and Z( ), change through a local (single-step) transformation () By understanding the local behavior, we will be able to reach conclusions about the overall transformation from to ( (),, () ) Proofs of the results in this section are given in the Appendix A Local transformation of rate and reliability Proposition : Suppose (, ) (, ) for some set of binary-input channels Then, I( ) I( ) I(), () with equality iff I() equals 0 or I( ) I( ) (5) The equality () indicates that the single-step channel transform preserves the symmetric capacity The inequality (5) together with () implies that the symmetric capacity remains unchanged under a single-step transform, I( ) I( ) I(), iff is either a perfect channel or a completely noisy one If is neither perfect nor completely noisy, the single-step transform moves the symmetric capacity away from the center in the sense that I( ) < I() < I( ), thus helping polarization Proposition 5: Suppose (, ) (, ) for some set of binary-input channels Then, Z( ) Z(), (6) Z( ) Z() Z(), (7) Z( ) Z() Z( ) () Equality holds in (7) iff is a BEC e have Z( ) Z( ) iff Z() equals 0 or, or equivalently, iff I() equals or 0 This result shows that reliability can only improve under a single-step channel transform in the sense that Z( ) Z( ) Z() (9) with equality iff is a BEC Since the BEC plays a special role wrt extremal behavior of reliability, it deserves special attention Proposition 6: Consider the channel transformation (, ) (, ) If is a BEC with some erasure probability ǫ, then the channels and are BECs with erasure probabilities ǫ ǫ and ǫ, respectively Conversely, if or is a BEC, then is BEC B Rate and reliability for (i) e now return to the context at the end of Sect II Proposition 7: For any B-DMC, n, n 0, i, the transformation ( (i), (i) (i ) ) (, (i) ) is rate-preserving and reliability-improving in the sense that I( (i ) Z( (i ) ) I( (i) ) Z( (i) (i) ) I( ), (30) (i) ) Z( ), (3) with equality in (3) iff is a BEC Channel splitting moves the rate and reliability away from the center in the sense that I( (i ) Z( (i ) ) I( (i) ) Z( (i) (i) ) I( ), (3) (i) ) Z( ), (33) with equality in (3) and (33) iff I() equals 0 or The reliability terms further satisfy ) Z( (i) (i) ) Z( ), (3) Z( (i) (i) ) Z( ), (35) Z( (i ) with equality in (3) iff is a BEC The cumulative rate and reliability satisfy I( (i) ) I(), (36) i i Z( (i) ) Z(), (37)

8 with equality in (37) iff is a BEC This result follows from Prop and Prop 5 as a special case and no separate proof is needed The cumulative relations (36) and (37) follow by repeated application of (30) and (3), respectively The conditions for equality in Prop are stated in terms of rather than (i) ; this is possible because: (i) by Prop, I() {0, } iff I( (i) ) {0, }; and (ii) is a BEC iff (i) is a BEC, which follows from Prop 6 by induction For the special case that is a BEC with an erasure probability ǫ, it follows from Prop and Prop 6 that the parameters {Z( (i) )} can be computed through the recursion Z( (j ) ) Z( (j) (j) / ) Z( / ), Z( (j) ) Z( (j) / ), (3) with Z( () ) ǫ The parameter Z( (i) ) equals the erasure probability of the channel (i) The recursive relations (6) follow from (3) by the fact that I( (i) (i) ) Z( ) for a BEC 0 () 000 () 00 () 00 () 0 (3) 00 () 0 () 0 (5) 00 (3) 0 (6) 0 () (7) 0 () () IV CHAEL POLARIZATIO e prove the main results on channel polarization in this section The analysis is based on the recursive relationships depicted in Fig 5; however, it will be more convenient to resketch Fig 5 as a binary tree as shown in Fig 6 The root node of the tree is associated with the channel The root gives birth to an upper channel () and a lower channel (), which are associated with the two nodes at level The channel () in turn gives birth to the channels () and (), and so on The channel (i) n is located at level n of the tree at node number i counting from the top There is a natural indexing of nodes of the tree in Fig 6 by bit sequences The root node is indexed with the null sequence The upper node at level is indexed with 0 and the lower node with Given a node at level n with index b b b n, the upper node emanating from it has the label b b b n 0 and the lower node b b b n According to this labeling, the channel (i) is situated at the node b b n b n with i n j b j n j e denote the channel (i) n located at node b b b n alternatively as bb n e define a random tree process, denoted {K n ; n 0}, in connection with Fig 6 The process begins at the root of the tree with K 0 For any n 0, given that K n b b n, K n equals b b n0 or b b n with probability / each Thus, the path taken by {K n } through the channel tree may be thought of as being driven by a sequence of iid Bernoulli RVs {B n ; n,, } where B n equals 0 or with equal probability Given that B,, B n has taken on a sample value b,, b n, the random channel process takes the value K n b b n In order to keep track of the rate and reliability parameters of the random sequence of channels K n, we define the random processes I n I(K n ) and Z n Z(K n ) For a more precise formulation of the problem, we consider the probability space (Ω, F, P) where Ω is the space of all Fig 6 The tree process for the recursive channel construction binary sequences (b, b, ) {0, }, F is the Borel field (BF) generated by the cylinder sets S(b,, b n ) {ω Ω : ω b,,ω n b n }, n, b,, b n {0, }, and P is the probability measure defined on F such that P(S(b,,b n )) / n For each n, we define F n as the BF generated by the cylinder sets S(b,,b i ), i n, b,,b i {0, } e define F 0 as the trivial BF consisting of the null set and Ω only Clearly, F 0 F F The random processes described above can now be formally defined as follows For ω (ω, ω, ) Ω and n, define B n (ω) ω n, K n (ω) ω ω n, I n (ω) I(K n (ω)), and Z n (ω) Z(K n (ω)) For n 0, define K 0, I 0 I(), Z 0 Z() It is clear that, for any fixed n 0, the RVs B n, K n, I n, and Z n are measurable with respect to the BF F n A Proof of Theorem e will prove Theorem by considering the stochastic convergence properties of the random sequences {I n } and {Z n } Proposition : The sequence of random variables and Borel fields {I n, F n ; n 0} is a martingale, ie, F n F n and I n is F n -measurable, (39) E[ I n ] <, (0) I n E[I n F n ] () Furthermore, the sequence {I n ; n 0} converges ae to a random variable I such that E[I ] I 0 Proof: Condition (39) is true by construction and (0) by the fact that 0 I n To prove (), consider a cylinder

9 9 set S(b,, b n ) F n and use Prop 7 to write E[I n S(b,, b n )] I( b b n0) I( b b n) I( b b n ) Since I( b b n ) is the value of I n on S(b,,b n ), () follows This completes the proof that {I n, F n } is a martingale Since {I n, F n } is a uniformly integrable martingale, by general convergence results about such martingales (see, eg, [9, Theorem 96]), the claim about I follows It should not be surprising that the limit RV I takes values ae in {0, }, which is the set of fixed points of I() under the transformation (, ) ( (), () ), as determined by the condition for equality in (5) For a rigorous proof of this statement, we take an indirect approach and bring the process {Z n ; n 0} also into the picture Proposition 9: The sequence of random variables and Borel fields {Z n, F n ; n 0} is a supermartingale, ie, F n F n and Z n is F n -measurable, () E[ Z n ] <, (3) Z n E[Z n F n ] () Furthermore, the sequence {Z n ; n 0} converges ae to a random variable Z which takes values ae in {0, } Proof: Conditions () and (3) are clearly satisfied To verify (), consider a cylinder set S(b,, b n ) F n and use Prop 7 to write E[Z n S(b,, b n )] Z( b b n0) Z( b b n) Z( b b n ) Since Z( b b n ) is the value of Z n on S(b,, b n ), () follows This completes the proof that {Z n, F n } is a supermartingale For the second claim, observe that the supermartingale {Z n, F n } is uniformly integrable; hence, it converges ae and in L to a RV Z such that E[ Z n Z ] 0 (see, eg, [9, Theorem 95]) It follows that E[ Z n Z n ] 0 But, by Prop 7, Z n Zn with probability /; hence, E[ Z n Z n ] (/)E[Z n ( Z n )] 0 Thus, E[Z n ( Z n )] 0, which implies E[Z ( Z )] 0 This, in turn, means that Z equals 0 or ae Proposition 0: The limit RV I takes values ae in the set {0, }: P(I ) I 0 and P(I 0) I 0 Proof: The fact that Z equals 0 or ae, combined with Prop, implies that I Z ae Since E[I ] I 0, the rest of the claim follows As a corollary to Prop 0, we can conclude that, as tends to infinity, the symmetric capacity terms {I( (i) : i } cluster around 0 and, except for a vanishing fraction This completes the proof of Theorem It is interesting that the above discussion gives a new interpretation to I 0 I() as the probability that the random process {Z n ; n 0} converges to zero e may use this to strengthen the lower bound in () (This stronger form is given as a side result and will not be used in the sequel) Proposition : For any B-DMC, we have I() Z() with equality iff is a BEC This result can be interpreted as saying that, among all B- DMCs, the BEC presents the most favorable rate-reliability trade-off: it minimizes Z() (maximizes reliability) among all channels with a given symmetric capacity I(); equivalently, it minimizes I() required to achieve a given level of reliability Z() Proof: Consider two channels and with Z() Z( ) z 0 Suppose that is a BEC Then, has erasure probability z 0 and I( ) z 0 Consider the random processes {Z n } and {Z n } corresponding to and, respectively By the condition for equality in (3), the process {Z n } is stochastically dominated by {Z n} in the sense that P(Z n z) P(Z n z) for all n, 0 z Thus, the probability of {Z n } converging to zero is lowerbounded by the probability that {Z n} converges to zero, ie, I() I( ) This implies I() Z() B Proof of Theorem e will now prove Theorem, which strengthens the above polarization results by specifying a rate of polarization Consider the probability space (Ω, F, P) For ω Ω, i 0, by Prop 7, we have Z i (ω) Z i (ω) if B i(ω) and Z i (ω) Z i (ω) Z i (ω) Z i (ω) if B i (ω) 0 For ζ 0 and m 0, define T m (ζ) {ω Ω : Z i (ω) ζ for all i m} For ω T m (ζ) and i m, we have which implies Z i (ω) Z i (ω) Z n (ω) ζ n m n im {, if B i (ω) 0 ζ, if B i (ω) (ζ/) Bi(ω), ω T m (ζ), n > m For n > m 0 and 0 < η < /, define n U m,n (η) {ω Ω : B i (ω) > (/ η)(n m)} im Then, we have [ Z n (ω) ζ η ζ η] n m, ω Tm (ζ) U m,n (η); from which, by putting ζ 0 and η 0 /0, we obtain Z n (ω) 5(n m)/, ω T m (ζ 0 ) U m,n (η 0 ) (5) ow, we show that (5) occurs with sufficiently high probability First, we use the following result, which is proved in the Appendix Lemma : For any fixed ζ > 0, δ > 0, there exists a finite integer m 0 (ζ, δ) such that P [T m0 (ζ)] I 0 δ/ Second, we use Chernoff s bound [0, p 53] to write P [U m,n (η)] (n m)[ H(/ η)] (6)

10 0 where H is the binary entropy function Define n 0 (m, η, δ) as the smallest n such that the RHS of (6) is greater than or equal to δ/; it is clear that n 0 (m, η, δ) is finite for any m 0, 0 < η < /, and δ > 0 ow, with m m (δ) m 0 (ζ 0, δ) and n n (δ) n 0 (m, η 0, δ), we obtain the desired bound: P[T m (ζ 0 ) U m,n(η 0 )] I 0 δ, n n Finally, we tie the above analysis to the claim of Theorem Define c 5m/ and and, note that V n {ω Ω : Zn (ω) c 5n/ }, n 0; T m (ζ 0 ) U m,n(η 0 ) V n, n n So, P(V n ) I 0 δ for n n On the other hand, P(V n ) n{z( ω n) c 5n/ } ω n X n A where A {i {,,} : Z( (i) ) c 5/ } with n e conclude that A (I 0 δ) for n n (δ) This completes the proof of Theorem Given Theorem, it is an easy exercise to show that polar coding can achieve rates approaching I(), as we will show in the next section It is clear from the above proof that Theorem gives only an ad-hoc result on the asymptotic rate of channel polarization; this result is sufficient for proving a capacity theorem for polar coding; however, finding the exact asymptotic rate of polarization remains an important goal for future research V PERFORMACE OF POLAR CODIG e show in this section that polar coding can achieve the symmetric capacity I() of any B-DMC The main technical task will be to prove Prop e will carry out the analysis over the class of G -coset codes before specializing the discussion to polar codes Recall that individual G -coset codes are identified by a parameter vector (, K, A, u A c) In the analysis, we will fix the parameters (, K, A) while keeping u A c free to take any value over X K In other words, the analysis will be over the ensemble of K G - coset codes with a fixed (, K, A) The decoder in the system will be the SC decoder described in Sect I-C A A probabilistic setting for the analysis Let (X Y, P) be a probability space with the probability assignment P({(u, y )}) (y u ) (7) for all (u, y ) X Y On this probability space, we define an ensemble of random vectors (U, X, Y, Û ) that represent, respectively, the input to the synthetic channel A recent result in this direction is discussed in Sect XI-A, the input to the product-form channel, the output of (and also of ), and the decisions by the decoder For each sample point (u, y ) X Y, the first three vectors take on the values U (u, y ) u, X (u, y ) u G, and Y (u, y ) y, while the decoder output takes on the value Û (u, y ) whose coordinates are defined recursively as Û i (u, y ) { u i, i A c h i (y, Ûi (u, y )), i A () for i,, A realization u X for the input random vector U corresponds to sending the data vector u A together with the frozen vector u A c As random vectors, the data part U A and the frozen part U A c are uniformly distributed over their respective ranges and statistically independent By treating U A c as a random vector over X K, we obtain a convenient method for analyzing code performance averaged over all codes in the ensemble (, K, A) The main event of interest in the following analysis is the block error event under SC decoding, defined as E {(u, y ) X Y : ÛA(u, y ) u A } (9) Since the decoder never makes an error on the frozen part of U, ie, Û A c equals U A c with probability one, that part has been excluded from the definition of the block error event The probability of error terms P e (, K, A) and P e (, K, A, u A c) that were defined in Sect I-C3 can be expressed in this probability space as P e (, K, A) P(E), P e (, K, A, u A c) P(E {U A c u A c}), (50) where {U A c u A c} denotes the event {(ũ, y ) X Y : ũ A c u A c} B Proof of Proposition e may express the block error event as E i A B i where B i {(u, y ) X Y : u i Û i (u, y ), u i Ûi(u, y )} (5) is the event that the first decision error in SC decoding occurs at stage i e notice that B i {(u, y ) X Y : u i Ûi (u, y ), u i h i (y, Ûi (u, y ))} {(u, y ) X Y : u i Ûi (u, y ), u i h i (y, u i )} {(u, y ) X Y : u i h i (y, u i )} E i where E i {(u, y ) X Y : (i ) (y, u i u i ) (i ) (y, ui u i )} (5)

11 Thus, we have E i A E i, P(E) i A P(E i ) For an upper bound on P(E i ), note that P(E i ) (y u ) Ei (u, y ) u,y u,y (y u ) (i) (y, ui u i ) (i) (y, ui u i ) Z( (i) ) (53) e conclude that P(E) i A Z( (i) ), which is equivalent to (3) This completes the proof of Prop The main coding theorem of the paper now follows readily C Proof of Theorem 3 By Theorem, for any given rate R < I(), there exists a sequence of information sets A with size A R such that Z( (i) ) max {Z( (i) i A )} O( ) (5) i A In particular, the bound (5) holds if A is chosen in accordance with the polar coding rule because by definition this rule minimizes the sum in (5) Combining this fact about the polar coding rule with Prop, Theorem 3 follows D A numerical example Although we have established that polar codes achieve the symmetric capacity, the proofs have been of an asymptotic nature and the exact asymptotic rate of polarization has not been found It is of interest to understand how quickly the polarization effect takes hold and what performance can be expected of polar codes under SC decoding in the nonasymptotic regime To investigate these, we give here a numerical study Let be a BEC with erasure probability / Figure 7 shows the rate vs reliability trade-off for using polar codes with block-lengths { 0, 5, 0 } This figure is obtained by using codes whose information sets are of the form A(η) {i {,,} : Z( (i) ) < η}, where 0 η is a variable threshold parameter There are two sets of three curves in the plot The solid lines are plots of R(η) A(η) / vs B(η) (i) i A(η) Z( ) The dashed lines are plots of R(η) vs L(η) max i A(η) {Z( (i) )} The parameter η is varied over a subset of [0, ] to obtain the curves The parameter R(η) corresponds to the code rate The significance of B(η) is also clear: it is an upper-bound on P e (η), the probability of block-error for polar coding at rate Bounds on probability of block error Rate (bits) Fig 7 Rate vs reliability for polar coding and SC decoding at block-lengths 0, 5, and 0 on a BEC with erasure probability / R(η) under SC decoding The parameter L(η) is intended to serve as a lower bound to P e (η) This example provides empirical evidence that polar coding achieves channel capacity as the block-length is increased a fact already established theoretically More significantly, the example also shows that the rate of polarization is too slow to make near-capacity polar coding under SC decoding feasible in practice 5 VI SYMMETRIC CHAELS The main goal of this section is to prove Theorem, which is a strengthened version of Theorem 3 for symmetric channels A Symmetry under channel combining and splitting Let : X Y be a symmetric B-DMC with X {0, } and Y arbitrary By definition, there exists a a permutation π on Y such that (i) π π and (ii) (y ) (π (y) 0) for all y Y Let π 0 be the identity permutation on Y Clearly, the permutations (π 0, π ) form an abelian group under function composition For a compact notation, we will write x y to denote π x (y), for x X, y Y Observe that (y x a) (a y x) for all a, x X, y Y This can be verified by exhaustive study of possible cases or by noting that (y x a) ((x a) y 0) (x (a y) 0) (a y x) Also observe that (y x a) (x y a) as is a commutative operation on X For x X, y Y, let x y 0 (x y,, x y ) (55) This associates to each element of X a permutation on Y Proposition : If a B-DMC is symmetric, then is also symmetric in the sense that (y x a ) (x y a ) (56) for all x, a X, y Y The proof is immediate and omitted

12 Proposition 3: If a B-DMC is symmetric, then the channels and (i) are also symmetric in the sense that (y u ) (a G y u a ), (57) (i) (y, ui u i ) (i) (a G y, u i a i u i a i ) (5) for all u, a X, y Y, n, n 0, i Proof: Let x u G and observe that (y u ) i (y i x i ) i (x i y i 0) (x y 0 ) ow, let b a G, and use the same reasoning to see that (b y u a ) ((x b ) (b y ) 0 ) (x y 0 ) This proves the first claim To prove the second claim, we use the first result (i) (y, u i u i ) u i u i (y u ) (a G y u a ) (a G y, ui a i u i a i ) where we used the fact that the sum over u i X i can be replaced with a sum over u i a i for any fixed a since {u i a i : u i X i } X i B Proof of Theorem e return to the analysis in Sect V and consider a code ensemble (, K, A) under SC decoding, only this time assuming that is a symmetric channel e first show that the error events {E i } defined by (5) have a symmetry property Proposition : For a symmetric B-DMC, the event E i has the property that (u, y ) E i iff (a u, a G y ) E i (59) for each i, (u, y ) X Y, a X Proof: This follows directly from the definition of E i by using the symmetry property (5) of the channel (i) ow, consider the transmission of a particular source vector u A and a frozen vector u A c, jointly forming an input vector u for the channel This event is denoted below as {U u } instead of the more formal {u } Y Corollary : For a symmetric B-DMC, for each i and u X, the events E i and {U u } are independent; hence, P(E i ) P(E i {U u }) Proof: For (u, y ) X Y and x u G, we have P(E i {U u }) y (y u ) E i (u, y ) (x y 0 ) E i (0, x y ) y (60) P(E i {U 0 }) (6) Equality follows in (60) from (57) and (59) by taking a u, and in (6) from the fact that {x y : y Y } Y for any fixed x X The rest of the proof is immediate ow, by (53), we have, for all u X, P(E i {U u }) Z( (i) ) (6) and, since E i A E i, we obtain P(E {U u }) i A Z( (i) ) (63) This implies that, for every symmetric B-DMC and every (, K, A, u A c) code, P e (, K, A, u A c) u A X K K P(E {U u }) i A Z( (i) ) (6) This bound on P e (, K, A, u A c) is independent of the frozen vector u A c Theorem is now obtained by combining Theorem with Prop, as in the proof of Theorem 3 ote that although we have given a bound on P(E {U u }) that is independent of u, we stopped short of claiming that the error event E is independent of U because our decision functions {h i } break ties always in favor of û i 0 If this bias were removed by randomization, then E would become independent of U C Further symmetries of the channel (i) e may use the degrees of freedom in the choice of a in (5) to explore the symmetries inherent in the channel (i) For a given (y, u i ), we may select a with a i u i to obtain (i) (y, u i u i ) (i) (a G y, 0 i 0) (65) So, if we were to prepare a look-up table for the transition probabilities { (i) (y, ui u i ) : y Y, u i X i }, it would suffice to store only the subset of probabilities { (i) (y, 0 i 0) : y Y } The size of the look-up table can be reduced further by using the remaining degrees of freedom in the choice of a i Let Xi {a X : a i 0i }, i Then, for any i, a Xi, and y Y, we have (i) (y, 0i 0) (i) (a G y, 0i 0) (66) which follows from (65) by taking u i 0 i on the left hand side To explore this symmetry further, let Xi y {a G y : a Xi } The set Xi y is the orbit of y under the action group Xi The orbits Xi y over variation of y partition the space Y into equivalence classes Let Yi be a set formed by taking one representative from each equivalence class The output alphabet of the channel (i) can be represented effectively by the set Yi For example, suppose is a BSC with Y {0, } Each orbit Xi y has i elements and there are i orbits In particular, the channel () has effectively two outputs, and being symmetric, it has to be a BSC This is a great

13 3 simplification since () has an apparent output alphabet size of Likewise, while (i) has an apparent output alphabet size of i, due to symmetry, the size shrinks to i Further output alphabet size reductions may be possible by exploiting other properties specific to certain B-DMCs For example, if is a BEC, the channels { (i) } are known to be BECs, each with an effective output alphabet size of three The symmetry properties of { (i) } help simplify the computation of the channel parameters Proposition 5: For any symmetric B-DMC, the parameters {Z( (i) )} given by (7) can be calculated by the simplified formula Z( (i) ) i y Y i X i y (i) (y, 0i 0) (i) (y, 0i ) e omit the proof of this result For the important example of a BSC, this formula becomes Z( (i) ) (i) (y, 0i 0) (i) (y, 0i ) u u u / u / u / u R u v y u 3 v y u / u u u / v / y / v / v / v / y / y / y y Y i This sum for Z( (i) ) has i terms, as compared to i terms in (7) Fig An alternative realization of the recursive construction for VII ECODIG In this section, we will consider the encoding of polar codes and prove the part of Theorem 5 about encoding complexity e begin by giving explicit algebraic expressions for G, the generator matrix for polar coding, which so far has been defined only in a schematic form by Fig 3 The algebraic forms of G naturally point at efficient implementations of the encoding operation x u G In analyzing the encoding operation G, we exploit its relation to fast transform methods in signal processing; in particular, we use the bit-indexing idea of [] to interpret the various permutation operations that are part of G A Formulas for G In the following, assume n for some n 0 Let I k denote the k-dimensional identity matrix for any k e begin by translating the recursive definition of G as given by Fig 3 into an algebraic form: G (I / F)R (I G / ), for, with G I Either by verifying algebraically that (I / F)R R (F I / ) or by observing that channel combining operation in Fig 3 can be redrawn equivalently as in Fig, we obtain a second recursive formula G R (F I / )(I G / ) R (F G / ), (67) valid for This form appears more suitable to derive a recursive relationship e substitute G / R / (F G / ) back into (67) to obtain G R ( F ( R/ ( F G/ ))) R ( I R / ) ( F G / ) (6) where (6) is obtained by using the identity (AC) (BD) (A B)(C D) with A I, B R /, C F, D F G / Repeating this, we obtain G B F n (69) where B R (I R / )(I R / ) (I / R ) It can seen by simple manipulations that B R (I B / ) (70) e can see that B is a permutation matrix by the following induction argument Assume that B / is a permutation matrix for some ; this is true for since B I Then, B is a permutation matrix because it is the product of two permutation matrices, R and I B / In the following, we will say more about the nature of B as a permutation B Analysis by bit-indexing To analyze the encoding operation further, it will be convenient to index vectors and matrices with bit sequences Given a vector a with length n for some n 0, we denote