Universal coding for classes of sources

Coexios module: m46228 Uiversal codig for classes of sources Dever Greee This work is produced by The Coexios Project ad licesed uder the Creative Commos Attributio Licese We have discussed several parametric sources, ad will ow start developig mathematical tools i order to ivestigate properties of uiversal codes that oer uiversal compressio w.r.t. a class of parametric sources. Prelimiaries Cosider a class of parametric models, where the parameter set characterizes the distributio for a specic source withi this class, {p ( ), }. Example Cosider the class of memoryless sources over a alphabet α = {, 2,..., r}. Here we have = {p (), p (2),..., p (r )}. () The goal is to d a xed to variable legth lossless code that is idepedet of, which is ukow, yet achieves l (X ) E H (X), (2) where expectatio is take w.r.t. the distributio implied by. We have see for p (x) = 2 p (x) + 2 p 2 (x) (3) that a code that is good for two sources (distributios) p ad p 2 exists, modulo the oe bit loss here. As a expasio beyod this idea, cosider p (x) = dw () p (X), (4) where w () is a prior. Example 2 Let us revisit the memoryless source, choose r = 2, ad dee the scalar parameter Versio.2: May 6, 203 2:48 pm -0500 http://creativecommos.org/liceses/by/3.0/ "Source models", (2) <http://cx.org/cotet/m4623/latest/#uid> http://cx.org/cotet/m46228/.2/

Coexios module: m46228 2 The = Pr (X i = ) = Pr (X i = 0). (5) ad p (x) = X () ( ) X (0) (6) Moreover, it ca be show that p (x) = 0 d X () ( ) X (0). (7) p (x) = X (0)! X ()!, (8) ( + )! this result appears i Krichevsky ad Tromov [2]. Is the source X implied by the distributio p (x) a ergodic source? Cosider the evet lim i= X i 2. Owig to symmetry, i the limit of large the probability of this evet uder p (x) must be 2, Pr{ lim X i 2 } = 2. (9) i= O the other had, recall that a ergodic source must allocate probability 0 or to this avor of evet. Therefore, the source implied by p (x) is ot ergodic. Recall the deitios of p (x) ad p (x) i (6) ad (7), respectively. Based o these deitios, cosider the followig, H (X ) = X A p (X ) logp (X ) = H (X Θ = ), H (X ) = X p (X ) logp (X ), H (X Θ) = dw () H (X Θ = ). We get the followig quatity for mutual iformatio betwee the radom variable Θ ad radom sequece X N, (0) I (Θ; X ) = H (X ) H (X Θ). () Note that this quatity represets the gai i bits that the parameter creates; more about this quatity will be metioed later. 2 Redudacy We ow dee the coditioal redudacy, r (l, ) = [E (l (X )) H (X )], (2) this quaties how far a codig legth fuctio l is from the etropy where the parameter is kow. Note that l (X ) = dw () E (l (X )) H (X ). (3) http://cx.org/cotet/m46228/.2/

Coexios module: m46228 3 Deote by c the collectio of lossless codes for legth- iputs, ad dee the expected redudacy of a code l C by The asymptotic expected redudacy follows, R (w, l) = dw () r (l, ), R (w) = if l C R (w, l). (4) R (w) = lim R (w), (5) assumig that the limit exists. We ca also dee the miimum redudacy that icorporates the worst prior for parameter, while keepig the best code. Similarly, R = sup R (w), (6) w W Let us derive R, R = lim R. (7) R = supif dw () [E (l (X )) H (X Θ = )] w l = supif E p [l (X ) H (X Θ)] w l = sup [H (X ) H (X Θ)] w = sup I (Θ; X ) = C, w where C is the capacity of a chael from the sequece x to the parameter [4]. That is, we try to estimate the parameter from the oisy chael. I a aalogous maer, we dee (8) R + = ifsupr (l, ) = if l sup = if Q l E sup where Q is the prior iduced by the codig legth fuctio l. [ log p (x ) 2 l(x ) D (P Q), ] (9) 3 Miimal redudacy Note that Therefore, w, l, R + = if l sup supr (l, ) r (l, ) supif w w (d) r (l, ) if w (d) r (l, ). l c l (20) w (d) r (l, ) = R. (2) http://cx.org/cotet/m46228/.2/

Coexios module: m46228 4 I fact, Gallager showed that R + = R. That is, the mi-max ad max-mi redudacies are equal. Let us revisit the Beroulli source p where = [0, ]. From the deitio of (6), which relies o a uiform prior for the sources, i.e., w () =,, it ca be show that there there exists a uiversal code with legth fuctio l such that [ ( )] E [l (x x () )] E h 2 + log ( + ) + 2, (22) where h 2 (p) = plog (p) ( p) log ( p) is the biary etropy. That is, the redudacy is approximately log () bits. Clarke ad Barro [] studied the weightig approach, p (x) = dw () p (x), (23) ad costructed a prior that achieves R = R + precisely for memoryless sources. Theorem 5 [] For memoryless source with a alphabet of size r, = (p (0), p (),, p (r )), R (w) = r ( ) ( ) I () 2 log + w (d) log + O (), (24) 2πe w () where O () vaishes uiformly as for ay compact subset of, ad [ ( lp ) ( ) ] T (x i ) lp (x i ) I () E is Fisher's iformatio. Note that whe the parameter is sesitive to chage we have large I (), which icreases the redudacy. That is, good sesitivity meas bad uiversal compressio. Deote I () J () = I (' ) d, (26) ' this is kow as Jerey's prior. Usig w () = J (), it ca be show that R = R +. Example 3 Let us derive the Fisher iformatio I () for the Beroulli source, p (x) = x() ( ) x(0) lp (x) = x () l + x (0) l ( ) E lp (x) ( lp (x) [ ( lp (x) = x () x (0) ) 2 = 2 x () ) 2 ] 2 + 2 x (0) ( ) 2 2x()x(0) ( ) = 2 + ( ) 2 2 ( ) E [ x () x (0)] = + 0 = ( ). Therefore, the Fisher iformatio satises I () = ( ). Example 4 Recall the KrichevskyTromov codig, which was metioed i Example 2. Usig the deitio of Jereys' prior (26), we see that J (). Takig the itegral over Jeery's prior, ( ) (25) (27) http://cx.org/cotet/m46228/.2/

Coexios module: m46228 5 p J (x ) = 0 c d x() ( ) x(0) ( ) = c 0 x() 2 ( ) x(0) 2 d Γ( x(0)+ = 2)Γ( x()+ 2) πγ(+), where we used the gamma fuctio. It ca be show that (28) where p J (x ( ) = p J xt+ x t ), (29) t=0 p J (x t+ x t ) = p J(x t+ ) p J(x t ), p J (x t+ = 0 x t ) = t x (0)+ 2 t+, p J (x t+ = x t ) = t x ()+ 2 t+. Similar to before, this uiversal code ca be implemeted sequetially. It is due to Krichevsky ad Tromov [2], its redudacy satises Theorem 5 (p. 4) by Clarke ad Barro [], ad it is commoly used i uiversal lossless compressio. 4 Rissae's boud Let us cosider o a ituitive level why C r 2 log() (30). Expedig r log () bits allows to dieretiate betwee ( ) r parameter vectors. That is, we would dieretiate betwee each of the r parameters with levels. Now cosider a Beroulli RV with (ukow) parameter. Oe perspective is that with drawigs of the RV, the stadard deviatio i the umber of 's is O ( ). That is, levels dieretiate betwee parameter levels up to a resolutio that reects the radomess of the experimet. A secod perspective is that of codig a sequece of Beroulli outcomes with a imprecise parameter, where it is coveiet to thik of a uiversal code i terms of rst quatizig the parameter ad the usig that (imprecise) parameter to ecode the iput x. parameter ML satises 2 For the Beroulli example, the maximum likelihood ML = argmax{ x() ( ) x(0) }, (3) ad pluggig this parameter = ML ito p (x) miimizes the codig legth amog all possible parameters,. It is readily see that ML = x (). (32) Suppose, however, that we were to ecode with ' = ML +. The the codig legth would be l (x) = log ( ( ' ) x()( ' ) x(0) ). (33) It ca be show that this codig legth is suboptimal w.r.t. l ML (x) by O ( 2) bits. Keep i mid that doublig the umber of parameter levels used by our uiversal ecoder requires a extra bit to ecode the extra factor of 2 i resolutio. It makes sese to exped this extra bit oly if it buys us at least oe other bit, meaig that O ( 2) =, which implies that we ecode ML to a resolutio of /, correspodig to O ( ) levels. Agai, this is a redudacy of 2log () bits per parameter. http://cx.org/cotet/m46228/.2/

Coexios module: m46228 6 Havig described Rissae's result ituitively, let us formalize matters. Cosider {p, }, where R K is a compact set. Suppose that there exists a estimator ^ such that (c) : p { ^ (x ) > c } δ (c), (34) where lim c δ (c) = 0. The we have the followig coverse result. Theorem 6 (Coverse to uiversal codig [5]) Give a parametric class that satises the above coditio (34), for all ε > 0 ad all codes l that do ot kow, r (l, ) ( ε) K log (), (35) 2 except for a class of i B ε () whose Lebesgue volume shriks to zero as icreases. That is, a uiversal code caot compress at a redudacy substatialy below 2log () bits per parameter. Rissae also proved the followig achievable result i his semial paper. Theorem 7 (Achievable to uiversal codig [5]) If p (x) is twice dieretiable i for every x, the there exists a uiversal code such that : r (l, ) ( + ε) K log() 2. 5 Uiversal codig for piecewise i.i.d. sources We have emphasized statioary parametric classes, but a parametric class ca be ostatioary. Let us show how uiversal codig ca be achieved for some ostatioary classes of sources by providig a example. Cosider = {0,,..., } where p (x ) = Q ( x ) Q2 ( x + ), (36) where Q ad Q 2 are both kow i.i.d. sources. This is a piecewise i.i.d. source; i each segmet it is i.i.d., ad there is a abrupt trasitio i statistics whe the rst segmet eds ad the secod begis. Here are two approaches to codig this source.. Ecode the best idex ML usig log ( + ) bits, the ecode p ML (x ). This is kow as two-part code or plug-i; after ecodig the idex, we plug the best parameter ito the distributio. Clearly, l (x) = mi 0 logp (x) + log ( + ) logp (x) + log ( + ) + 2. 2. The secod approach is a mixture, we allocate weights for all possible parameters, ( ) l (x) = log + i=0 p i (x ( ) < log + p ML (x ) = log (p ML (x)) + log ( + ). (37) (38) Merhav [3] provided redudacy theorems for this class of sources. Algorithmic approaches to the mixture appear i Shamir ad Merhav [6] ad Willems [7]. The theme that is commo to both approaches, the plug-i ad the mixture, is that they lose approximately log () bits i ecodig the locatio of the trasitio. Ideed, Merhav showed that the pealty for each trasitio i uiversal codig is approximately log () bits [3]. Ituitively, the reaso that the redudacy required to ecode the locatio of the trasitio is larger tha the 2log () from Rissae [5] is because the locatio of the trasitio must be described precisely to prevet payig a big codig legth pealty i ecodig segmets usig the wrog i.i.d. statistics. I cotrast, i ecodig our Beroulli example http://cx.org/cotet/m46228/.2/

Coexios module: m46228 7 a imprecisio of i ecodig ML i the rst part of the code yields oly a O () bit pealty i the secod part of the code. It is well kow that mixtures out-compress the plug-i. However, i may cases they do so by oly a small amout per parameter. For example, Baro et al. showed that the plug-i for i.i.d. sources loses approximately bit per parameter w.r.t. the mixture. Refereces [] B.S. Clarke ad A.R. Barro. Jereys' prior is asymptotically least favorable uder etropy risk. J. Stat. Plaig Iferece, 4():3782;60, 994. [2] R. Krichevsky ad V. Tromov. The performace of uiversal ecodig. IEEE Tras. If. Theory, 27(2):9982;207, 98. [3] N. Merhav. O the miimum descriptio legth priciple for sources with piecewise costat parameters. IEEE Tras. If. Theory, 39(6):96282;967, 993. [4] N. Merhav ad M. Feder. A strog versio of the redudacy-capacity theorem of uiversal codig. IEEE Tras. If. Theory, 4(3):7482;722, 995. [5] J. Rissae. Uiversal codig, iformatio, predictio, ad estimatio. IEEE Tras. If. Theory, 30(4):62982;636, Jul. 984. [6] G.I. Shamir ad N. Merhav. Low-complexity sequetial lossless codig for piecewise-statioary memoryless sources. IEEE Tras. If. Theory, 45(5):49882;59, 999. [7] F.M.J. Willems. Codig for a biary idepedet piecewise-idetically-distributed source. IEEE Tras. If. Theory, 42(6):22082;227, 996. http://cx.org/cotet/m46228/.2/