Analysis of symbolic sequences using the Jensen-Shannon divergence

Transcription

1 PHYSICAL REVIEW E, VOLUME 65, Analysis of sybolic sequences using the Jensen-Shannon divergence Ivo Grosse, 1,2 Pedro Bernaola-Galván, 2,3 Pedro Carpena, 2,3 Raón Roán-Roldán, 4 Jose Oliver, 5 and H. Eugene Stanley 2 1 Cold Spring Harbor Laboratory, Cold Spring Harbor, New Yor Center for Polyer Studies and Departent of Physics, Boston University, Boston, Massachusetts Departaento de Física Aplicada II, ETSI de Telecounicación, Universidad de Málaga, E Málaga, Spain 4 Departaento de Física Aplicada, Universidad de Granada, E Granada, Spain 5 Departaento de Genética e Instituto de Biotecnología, Universidad de Granada, E Granada, Spain Received 22 Deceber 2000; revised anuscript received 8 August 2001; published 25 March 2002 We study statistical properties of the Jensen-Shannon divergence D, which quantifies the difference between probability distributions, and which has been widely applied to analyses of sybolic sequences. We present three interpretations of D in the fraewor of statistical physics, inforation theory, and atheatical statistics, and obtain approxiations of the ean, the variance, and the probability distribution of D in rando, uncorrelated sequences. We present a segentation ethod based on D that is able to segent a nonstationary sybolic sequence into stationary subsequences, and apply this ethod to DNA sequences, which are nown to be nonstationary on a wide range of different length scales. DOI: /PhysRevE PACS nubers: Cc I. INTRODUCTION The statistical analysis of sybolic sequences is of central iportance in various fields of science, such as sybolic dynaics 1,2, linguistics following the pioneering wors of Shannon 3, or DNA sequence analysis 4 7. One advantage of using inforation theoretical functionals for the analysis of sybolic sequences is that they do not require the sybolic sequence to be apped to a nuerical sequence, which is necessary in spectral or correlation analyses 8. One of these functionals is the Jensen-Shannon divergence D 9 12, which quantifies the difference between two or ore probability distributions, and which can be used to copare the sybol coposition between different sequences. There are three reasons why we choose D as a easure of divergence between probability distributions: i D is related to other inforation-theoretical functionals, such as the relative entropy or the Kullbac divergence, and hence it shares their atheatical properties as well as their intuitive interpretability, ii D can be generalized to easure the distance between ore than two distributions, and iii the copared distributions can be weighted, which allows us to tae into account the different lengths of the subsequences fro which the probability distributions are coputed 13. D has been used for easuring the distance between rando graphs 10, for testing the goodness-of-fit of point estiations 12, in the analysis of DNA sequences 13,14, in the segentation of textured iages 15, and in the design of a statistical characterization of the obility edge in disordered aterials 16. In addition, by aing use of its ability to be generalized to an arbitrary nuber of probability distributions, D has been used to quantify the coplex heterogeneity of DNA sequences as well as to detect borders between coding and noncoding DNA 20. Here we describe in detail soe statistical properties of D as well as soe theoretical bacground relevant for the above-entioned applications. This paper is organized as follows: in Sec. II we introduce D and soe of its atheatical properties. In Sec. III we provide three interpretations of D, one in the fraewor of statistical physics, one in the fraewor of inforation theory, and one in the fraewor of atheatical statistics. In Sec. IV we discuss soe statistical properties of D, and we derive the ean, the variance, and the asyptotic probability distribution function of D. In Sec. V we apply the Jensen-Shannon divergence to the proble of segenting a nonstationary sequence into stationary subsequences, and show that in this context the axiu value D ax of the Jensen-Shannon divergence D sapled along a sequence becoes a quantity of central iportance. Hence, we study the probability distribution of D ax by eans of Monte-Carlo siulations. In Sec. VI we present three exaples of how D can be applied to the proble of segenting nonstationary sybolic sequences such as DNA sequences into stationary subsequences, and Sec. VII concludes this paper. II. THE JENSEN-SHANNON DIVERGENCE Several easures have been proposed to quantify the difference soeties called divergence between two or ore probability distributions 9. One of those easures is the Jensen-Shannon divergence, which is defined as follows: let p (1) (p (1) 1,p (1) 2,...,p (1) ) and p (2) (p (2) 1,p (2) 2,...,p (2) ) denote two probability distributions satisfying the usual constraints p j) ( ( i 1 and 0p j) i 1 for all,2,..., and j1, 2; and let (1) and (2) denote the weights of the distributions p (1) and p (2), satisfying the constraints (1) (2) 1 and 0 ( j) 1. Then the Jensen-Shannon divergence D between the probability distributions p (1) and p (2) with weights (1) and (2) is defined by 11 where Dp 1,p 2 H 1 p 1 2 p 2 1 Hp 1 2 Hp 2, X/2002/654/ /$ The Aerican Physical Society

2 IVO GROSSE et al. PHYSICAL REVIEW E Hp p i log 2 p i denotes the Shannon entropy of the probability distribution p(p 1,p 2,...,p ). The Jensen-Shannon divergence D can be shown to be a special case of the Jensen difference divergence introduced by Burbea and Rao 21. Also, D can be shown to be a special case of the divergence introduced by Csiszar 12,22. Hence, the Jensen-Shannon divergence D shares all atheatical properties of both the Jensen difference divergence and the divergence. It is interesting to note that the Jensen-Shannon divergence is the only easure that siultaneously belongs to the faily of Jensen difference divergences and the faily of divergences 12, i.e., the intersection of the faily of Jensen difference divergences and the faily of divergences contains only a single easure, and that easure is the Jensen-Shannon divergence D. In the following two paragraphs we list soe atheatical properties of D that turn out to be iportant for its application as a divergence easure. 1 By using the Jensen inequality 23 it is easy to see that Dp 1,p 2 0, with D p (1),p (2) 0 if and only if p (1) p (2). 2 D is syetric in its arguents p (1) and p (2), i.e., Dp 1,p 2 Dp 2,p 1. 3 D is well defined even if p (1) and p (2) are not absolutely continuous, i.e., D is well-defined even if p i (1) vanishes without vanishing p i (2) or if p i (2) vanishes without vanishing p i (1). D can be generalized to quantify the divergence between an arbitrary nuber of probability distributions. Let us consider probability distributions p (1), p (2),..., p (), and let us denote by (1), (2),..., () the corresponding weights. We can define the Jensen-Shannon divergence between the probability distributions p (1),p (2),...,p () with weights (1), (2),..., () by Dp 1,p 2,...,p H j1 j p j j1 j Hp j. It is interesting to note that the three atheatical properties entioned above for the binary case can be generalized to the -ary case as follows: 1 The Jensen inequality 23 iplies that Dp 1,p 2,...,p 0, with D p (1),p (2),...,p () 0 if and only if all probability distributions p (1),p (2),...,p () are identical, i.e., if and only if p (1) p (2) p () D is syetric in its arguents p (1),p (2),...,p (), i.e., D is invariant under any perutation of its arguents p (1),p (2),...,p (). 3 D is well defined even if the probability distributions p (1),p (2),...,p () are not absolutely continuous. III. INTERPRETATIONS OF D In the following three sections we will present three intuitive interpretations of the Jensen-Shannon divergence D. A. Interpretation of D in the fraewor of statistical physics In this section we show that D can be interpreted as the intensive ixture entropy in the following way: let us consider vessels, each one containing a ixture of ideal gases, let f ( j) ( f 1 ( j), f 2 ( j),...,f ( j) ) denote the vector of olar fractions of the gases in the jth vessel for j1,2,...,, and let n ( j) denote the total nuber of olecules in the jth vessel. Then we now fro the second law of therodynaics that the su of the Boltzann entropies of the separate vessels is saller than or equal to the Boltzann entropy of the joint vessel that we obtain after ixing the gases fro all vessels, and we can easily show that the difference of the su of the entropies obtained before the ideal gases are ixed and the entropy obtained after the ideal gases are ixed is equal to H ix N B ln 2Hf j1 n j B ln 2Hf j, where B denotes the Boltzann constant, N j1 n ( j) denotes the total nuber of ideal gas particles in all vessels, and f j1 (n ( j) /N)f ( j) denotes the vector of olar fractions of the gases in the ixture containing the gas particles of all of the vessels. H ix is coonly called ixing entropy, and it is easy to see that H ix N B ln 2D, if the weights are chosen to be ( j) n ( j) /N. Hence, D can be interpreted as the intensive ixture entropy easured in units of B ln 2. B. Interpretation of D in the fraewor of inforation theory In this section we show that D can be interpreted as the utual inforation in the following way: let us consider a sequence S of N sybols chosen fro the alphabet A a 1,a 2,...,a, and let us denote by p i the probability of finding sybol a i at an arbitrary but fixed position in sequence S, for,2,...,. Suppose that the sequence S is divided into subsequences S (1),S (2),...,S () of given lengths n (1),n (2),...,n () (, and let us denote by p j) i the probability of finding sybol a i at an arbitrary but fixed position in sequence S ( j), for,2,..., and j1,2,...,. In order to establish the connection between D and the utual inforation defined in the fraewor of inforation theory, we define the rando vector (a, s), where the ran

3 ANALYSIS OF SYMBOLIC SEQUENCES USING THE... PHYSICAL REVIEW E do variables a A and s S (1), S(2),...,S() are generated as follows: draw a rando position n with a unifor probability distribution along the sequence S, denote by a the sybol at position n, denote by s the subsequence that contains position n, and denote by p ij the joint probability of aa i and ss ( j) for,2,..., and j1,2,.... Then we obtain that the rando variable a assues the values a 1,a 2,...,a with probabilities p 1,p 2,...,p, and the rando variable s assues the values S (1),S (2),...,S () with probabilities (1) n (1) /N, (2) n (2) /N..., () n () /N, where the arginal possibilities p i and ( j) are defined by p i j1 p ij and j for,2,..., and j1,2,...,. Suppose that soeone is drawing a sybol a fro the entire sequence S, not telling us fro which subsequence s this sybol was drawn, and suppose it is our tas to guess that subsequence S fro which sybol a was drawn. One question answered by inforation theory is: How uch inforation I can we obtain fro learning the identity of the sybol a about the identity of that subsequence s fro which sybol a was drawn, provided we now the probability distribution p ij? I is called the utual inforation in a about s and defined by 3 I j1 p ij p ij log 2 j. p i Taing into account that p i ( j) denotes the conditional probability of finding sybol a i at an arbitrary but fixed position in a given fixed sequence S ( j), it follows that p ij ( j) p i ( j), and Eq. 9 can be rewritten as I j1 By rewriting Eq. 10 we obtain I j1 j p i j log 2 p i j j p i j log 2 p i j j1 p ij 9 p i. 10 j p i j log 2 p i. 11 As p i j1 ( j) ( p j) i defines the probability of finding sybol a i in the whole sequence, we obtain IDp 1,p 2,...,p. 12 Hence, D is identical to the utual inforation in a about s, which quantifies the aount of inforation we obtain fro learning the identity of the chosen sybol a about the identity of that subsequence s fro which sybol a was chosen. As I is syetric in its arguents a and s, we ay also consider the following gae: suppose soeone is drawing a sybol a fro sequence S, not telling us the identity of the drawn sybol a, but telling us the identity of that subsequence s fro which sybol a was drawn. Suppose further that it is our tas to guess the identity of the drawn sybol a. One question answered by inforation theory is: How uch inforation I can we obtain fro learning the identity of the subsequence s about the identity of the drawn sybol a, provided we now the probability distribution p ij. It can be atheatically proven that the utual inforation in a about s is identical to the utual inforation in s about a, and hence we can state that the Jensen-Shannon divergence D quantifies the aount of inforation we obtain fro learning the identity of the subsequence s about the identity of the drawn sybol a. If p (1) p (2) p (), then it is clear that nowing the identity of the sybol a does not tell us anything about the identity of the subsequence s fro which a was drawn, as the probability distributions of a are identical in all subsequences s. Liewise, it is clear that in this case nowing the subsequence s fro which a was drawn does not tell us anything about the identity of a. Hence, it is intuitively clear that the utual inforation in a about s or the utual inforation in s about a is equal to zero, and hence it is also intuitively clear that in this case the Jensen-Shannon divergence D is equal to zero. C. Interpretation of D in the fraewor of atheatical statistics In this section we show that D can be interpreted as the log-lielihood ratio in the following way: consider the proble of estiating the probabilities p(p 1,p 2,...,p ) fro a sybolic i.i.d. 24 sequence S of length N, in which at each position a sybol a i Aa 1,a 2,...,a is randoly drawn with probability p i. The axiu lielihood principle suggests to choose that probability vector p which axiizes the lielihood LSp p i F i, 13 where F i denotes the nuber of occurrences of sybol a i in sequence S. As the logarith is a strictly onotonic function, one ay equally search for that p which axiizes ln L F i ln p i. It is easy to derive by using one Lagrange ultiplier for the constraint p i 1 that p i F i /N axiizes the log-lielihood ln L. Hence, we obtain as axiu log-lielihood ln L ax N f i ln f i Nln 2Hf, 14 where f i F i /N denotes the relative frequency of finding sybol a i in sequence S of length N. Now consider the slightly ore coplicated proble of a nonstationary sequence S of length N consisting of stationary subsequences S (1),S (2),...,S () with lengths n (1),n (2),...,n () ( j), where the probability p i of generating sybol a i in subsequence S ( j) ay vary fro subsequence to subsequence. The lielihood of obtaining the entire sequence S is equal to the product of the lielihoods of obtain

4 IVO GROSSE et al. PHYSICAL REVIEW E ing the subsequences S (1),S (2),...,S (). Hence, the axiu lielihood principle suggests to choose for each j 1,2,..., that probability vector p ( j) (p 1 ( j),p 2 ( j),...,p ( j) ) that axiizes the lielihood LS j p j p j i F j i, 15 ( j) where F i is the nuber of occurrences of sybol a i in subsequence S ( j). It is again easy to derive by using ( Lagrange ultipliers for the constraints p j) i 1 that ( p j) ( i F j) i /n ( j) axiizes the log-lielihood ln L (j). Hence, we obtain as axiu log-lielihood ln L j ax n j f j i ln f j i n j ln 2Hf j, 16 i ( where f j) ( i F j) i /n ( j) denotes the relative frequency of finding sybol a i in subsequence S ( j) of length n ( j). As proble one with just one sequence is a special case of proble two of having sequences, the su of the (j) axiu log-lielihoods j1 ln L ax cannot be saller than ln L ax, because in the worst case in which all of the subsequences of proble two were identical, proble two would just reduce to proble one, giving the sae loglielihood as in proble one. Hence, the quantity L j1 ln L j ax ln L ax 17 is non-negative, and L is coonly called the loglielihood ratio. It is straightforward to see fro Eqs. 14, 16, and 17 that LNln 2D. 18 Hence, in the fraewor of atheatical statistics L can be interpreted as the increase of the log-lielihood when sequence S, instead of being odeled as a sequence generated with a single probability vector p, is odeled as a concatenation of subsequences S (1),S (2),...,S () in that order generated fro the probability vectors p (1),p (2),...,p (). The inequality L0 states that any partition of the original sequence into subsequences increases the lielihood of the second odel over the first odel. In order to choose hypothesis two subsequences in favor of hypothesis one only one sequence, we require that L be significantly greater than zero, and it is the goal of this paper to derive an approxiation of the probability distribution function of L. Note that in all of the above interpretations of D the weights of the distributions (1), (2),..., () are proportional to the sizes n (1),n (2),...,n () of the eleents considered: the nuber of particles of each of the ideal vessels or the nuber of sybols in each of the subsequences. It is interesting that this particular choice of weights arises in a natural way fro all of the three interpretations presented above, and as we will see later this choice of weights endows the Jensen-Shannon divergence D with several statistical properties that ae D particularly suitable for the analysis of sybolic sequences. IV. STATISTICAL PROPERTIES OF D Forally, D is a function of the probability distributions p (1),p (2),...,p (), but in analyses of experiental data those probability distributions are not directly observable. However, when we study experiental sybolic sequences we can estiate those probability distributions p ( j) fro the frequency distributions f ( j) ( ( f j) ( 1, f j) ( 2,...,f j) ( ), where f j) i denotes the relative frequency of sybol a i in subsequence S ( j), for,2,..., and j1,2,...,. In all analyses of experiental data the Jensen-Shannon divergence D ust be coputed fro those observable frequency distributions f (1),f (2),...,f () rather than fro the nonobservable probability distributions p (1),p (2),...,p (). ( As a consequence of replacing the probabilities p j) i by the ( j) corresponding relative frequencies f i in Eq. 1, the nuerical values of D will fluctuate fro data set to data set, even if those data sets can be assued to be generated fro the sae probability distribution. ( The fluctuation of f j) i fro data set to data set ay not only result in fluctuations of the nuerical values of D, but also in a systeatic shift bias of the nuerical values of D coputed fro the observed data as copared to the nuerical value of D coputed fro the unobservable probability distributions. In order to illustrate the presence of those fluctuations of D as well as its systeatic shift called bias, we perfor the following control experients: We generate an enseble of 2000 binary sequences ( 2) of N2500 sybols each, obtained by joining 2 subsequences as follows: we generate the left sequence of length n500 by concatenating rando, uncorrelated sybols drawn fro the probability distribution p (1) (0.45,0.55), and the right sequence of length Nn 2000 sybols drawn fro the probability distribution p (2) (0.55,0.45). We ove a cursor along the entire sequence, and we copute D between the subsequences at both sides of the cursor for all positions n (1) 1,2,..., N1 and n (2) N1, N 2,...,1. In order to illustrate the effect of different choices of the weights ( j), we copute the Jensen-Shannon divergence in two different ways: i for the choice of equal weights ( j) 1/ for all subsequences S ( j), and ii for the natural choice of weights ( j) n ( j) /N. In the following we denote by D 1/ the Jensen-Shannon divergence with the choice of equal weights i, and we denote by D the Jensen- Shannon divergence with the natural choice of weights ii. An ideal estiator of D, which quantifies the difference between two probability distributions, should reach its axiu value exactly at that point which separates the subsequences generated by different probability distributions, i.e., it should reach its axiu value when n (1) n500 and n (2) Nn2000. Figure 1a shows D versus n (1) and D 1/2 versus n (1), where the sybol denotes the en

5 ANALYSIS OF SYMBOLIC SEQUENCES USING THE... PHYSICAL REVIEW E FIG. 1. Coparison of D and D 1/2. We generate an enseble of 2000 binary sequences of length N2500, obtained by joining two subsequences of lengths n and Nn, where the left subsequence of length n is generated fro a probability distribution (x,1x) and the right subsequence of length Nn is generated fro a probability distribution (y,1y). We ove a cursor along the entire sequence and we copute D and D 1/2 between the subsequences at both sides of the cursor. Finally we plot the enseble averages D solid line and D 1/2 dashed line as a function of the position of the cursor n (1) 1,2,...,N1. In a we choose n500, x0.45, and y0.55, and find that D achieves its global axiu at n (1) 500 in the vicinity of the true fusion point of the two subsequences at n500, whereas D 1/2 achieves its global axiu at the edges n (1) 0 or n (1) 2500 far away fro the true fusion point of the two subsequences at n500. This finding indicates that D ight serve as an appropriate divergence easure to quantify the copositional differences between sybolic subsequences, whereas D 1/2 ight not. In b we choose n1250, x0.45, and y0.55, and find again that D achieves its global axiu at n (1) 1250 in the vicinity of the true fusion point of the two subsequences at n 1250, whereas D 1/2 achieves its global axiu at the edges n (1) 0orn (1) 2500 far away fro true fusion point of the two subsequences at n1250, confiring the finding fro a that D ight serve as an appropriate divergence easure to quantify the copositional differences between sybolic subsequences, whereas D 1/2 ight not. In c we choose n1250 and xy0.5, and we find that D stays quite constant at a sall value of approxiately bits, reflecting the fact that the analyzed sequences are stationary, whereas D 1/2 is clearly increasing as n (1) 0 or n (1) 2500, confiring the finding fro a and b that D ight serve as an appropriate divergence easure to quantify the copositional differences between sybolic subsequences, whereas D 1/2 ight not. The effect that even in the case of i.i.d. sequences the expected value of D is greater than zero is referred to as finite-size effect, and we address this finite-size effect in Sec. IV. seble average over all 2000 realizations. Figure 1a shows that there are draatic finite size effects when using D 1/2 dashed line instead of D solid line. While D clearly achieves its global axiu at position n (1) n500 ared with a vertical dotted line in Fig. 1a, D 1/2 achieves its highest values at the beginning and the end of the horizontal axis, i.e., at very sall and very large values of n (1). We perfor a second control experient siilar to the first experient, in which we change the lengths of the two subsequences to n1250 as well as Nn1250, and in which we eep all other paraeters the sae as before. Figure 1b shows clearly that, again, D achieves its axiu at n (1) n1250, while D 1/2 achieves its highest values at the beginning and the end of the horizontal axis, i.e., at very sall and very large values of n (1). These control experients deonstrate two results: i the location of the axiu of D can separate regions of different coposition and size in a sybolic sequence, and ii the estiation of D 1/2 and D fro sequences of finite length is affected by finite size effects. In order to illustrate point ii directly, we perfor a third control experient in which we generate the two subsequences fro the sae probability distribution. In this case the experientally obtained values of D that are nonzero are due only to statistical fluctuations. Figure 1c shows D versus n (1) and D 1/2 versus n (1) for an enseble of 2000 stationary, binary sequences of length N2500 in which each sybol is generated with probability 0.5. We find that, for all positions n (1), the values of D are approxiately the sae, whereas the values D 1/2 depend draatically on n (1). Figure 1c also shows that D is not identical to zero, and we devote the following three sections to derivations of approxiations of the ean, the variance, and the probability distribution function of D. A. Mean of D In this section we will derive an analytical approxiation of the ean value of D when coputed fro an enseble of finite i.i.d. sequences of length N. It follows directly fro the Jensen inequality that the expected value, Hf, of the entropy coputed fro an enseble of finite-length sequences cannot be greater than the theoretical value, Hp, of the entropy coputed fro the unobservable probabilities 25, i.e., HfHp, 19 where denotes the expectation value over the enseble of finite-length i.i.d. sequences generated by the probability distribution p. This atheatical stateent is intuitively clear: due to the finite saple size, the relative frequency vector f fluctuates fro saple to saple around the probability vector p, and the ajority of these fluctuations will ae f less unifor than p. Since the entropy Hp quantifies the unifority of the probability distribution p, we expect that the ajority of the values of Hf coputed fro an enseble of fluctuating frequency vectors f will be saller than the value of Hp

6 IVO GROSSE et al. PHYSICAL REVIEW E Up to first order the expected value of Hf can be approxiated by HfHp 1 2N ln 2, 20 where is the nuber of coponents of the probability and frequency vectors p and f, N is the saple size, and the sybol indicates that we neglect ters of the order of O(1/N 2 ). By applying Eq. 20 to each of the subsequences we obtain Hf j Hp j 1 2n j ln 2, 21 for j1,2,...,, where the sybol indicates that we neglect ters of the order of O 1/(n ( j) ) 2. We will use approxiations 20 and 21 to derive in the reainder of this section the expected value of the Jensen-Shannon divergence Df (1),f (2),...,f () coputed fro an enseble of i.i.d. sequences of total length N. In order to avoid lengthy expressions, we define DF Df (1),f (2),...,f () and DPDp (1),p (2),...,p (), and by substituting Eqs. 20 and 21 into Eq. 1 we obtain DFDP 1 N 2N ln 2 j1 j n j1. 22 This expression shows that, in general, the bias DF DP depends on the lengths n ( j) of the subsequences. It is easy to see that one choice of weights that aes Eq. 22 independent of the subsequence lengths n ( j) is j n j /N, 23 for j1,2,...,. This finding is interesting because this particular choice of weights turns out to be identical to the natural choice of weights in all of the three interpretations of D presented in Sec. III. With this choice of weights, the expected value of the Jensen-Shannon divergence D becoes DFDp 1 2N ln 2 1, 24 which is independent of the subsequence lengths n ( j). Figure 2 illustrates the independence of the ean value of D of the subsequence lengths n ( j), and it also shows that Eq. 24 is a reasonable approxiation of the ean value of D. Hence, expression 24 can be used as a reference to decide if a difference in coposition between two sequences is larger than expected. Note that in Fig. 1c the average value of D fits the value predicted by Eq. 24, naely, D bits. In addition, fro Eq. 24 we see that the bias of the quantity ND is independent of the sequence length N, which allows us to copare Jensen-Shannon divergence values obtained fro sequences of different sizes. FIG. 2. Mean value of D as a function of the total sequence length N, ranging fro N10 to N10 5, averaged over an enseble of 2000 i.i.d. sequences generated fro a four-letters alphabet (4), where each sybol occurs with probability 1/4. For each sequence length N we choose three different cutting points n (1) 0.5N, n (1) 0.6N, and n (1) 0.7N, and we copute for each N and each n (1) and each of the 2000 i.i.d. sequences the Jensen- Shannon divergence D between the coposition of the left subsequence of length n (1) and the coposition of the right subsequence of length n (2) Nn (1). For each N and n (1) we copute the average of D over the enseble of all 2000 i.i.d. sequences, and the figure shows the enseble average D as a function of N and n (1). We find that log 10 D decays alost linearly as a function log 10 N, with a slope very close to 1, for each n (1) 0.5N circles, n (1) 0.6N triangles, and n (1) 0.7N diaonds, and we also find that the approxiation of D fro Eq. 24 solid line agrees very well with the siulation results. With the naive choice of weights ( j) 1/ we obtain for the expected value of the Jensen-Shannon divergence the approxiation D 1/ FD 1/ P 1 2N ln 2 N A1, 25 where A j1 1/n ( j) denotes the haronic ean of the subsequence lengths n ( j). Clearly D 1/ depends on the subsequence lengths n ( j), and we see that D 1/ becoes inial for n ( j) N/, while D 1/ diverges to infinity for n ( j) 0. This analytical approxiation of the expected value of D 1/ is consistent with the draatic increase of the dashed line corresponding to D 1/ close to the edges n (1) 0or n (2) 0 of the abscissa of Fig. 1. There is another advantage of choosing the weights ( j) by Eq. 23. We will show in the following section that the choice of the weights ( j) n ( j) /N iniizes the quadratic deviation of the observed fro the true Jensen-Shannon divergence. This advantage is ore iportant than the advantage of having a bias that is independent of n ( j), because the bias can be corrected analytically, in a first-order approxiation, whereas the quadratic deviation of the observed fro the true Jensen-Shannon divergence i.e., the quadratic error cannot be reduced. Hence, it is desirable to obtain an estiator of D that iniizes the quadratic deviation of the observed fro the true Jensen-Shannon divergence i.e., the quadratic error, and we will show in the following section that the choice of the weights ( j) n ( j) /N yields exactly that optial estiator

7 ANALYSIS OF SYMBOLIC SEQUENCES USING THE... PHYSICAL REVIEW E B. Variance of D The variance of DF is given by 2 DF 2 Hf j1 j Hf j 2 Hf 2 j 2 Hf j j1 2 j1 2 j1 j covhf,hf j j l covhf j,hf l. l j1 26 As the set of vectors f (1),f (2),...,f () is productultinoially distributed, we obtain that Hf ( j) and Hf (l) are statistically independent for any jl. Hence, the ters cov(hf ( j),hf (l) ) are all equal to zero, and we need to consider only the ters 2 (Hf), 2 (Hf ( j) ), and cov(hf,hf ( j) ). By Taylor-expanding Hf about p we obtain a first-order approxiation of the variance of Hf 5,6,27,28, 2 Hf 1 N 2 log 2 p, 27 where n j denotes the length of subsequence S ( j), 2 (log 2 p ( j) ) denotes the variance of the nubers log 2 p i with respect to the probability distribution p i, and the sybol indicates that we neglect ters of the order of O(1/N 2 ). Liewise, we obtain a first-order approxiation of the variance of Hf ( j), 2 Hf j 1 n j 2 log 2 p j, 28 where N denotes the length of the whole sequence, 2 (log 2 p ( j) ) denotes the variance of the nubers log 2 p i (j) with respect to the probability distribution p i ( j) for every j1,2,...,, and the sybol indicates that we neglect ters of the order of O 1/(n ( j) ) 2. In the Appendix we derive a siilar first-order approxiation of the covariance ters, and under the null hypothesis that p (1) p (2) p () p we obtain and Hf ( j) Eq. 29 is equal to the first-order approxiation of the variance of Hf Eq. 27. By substituting the expressions fro Eqs. 27, 28, and 29 into Eq. 26 we obtain for the variance of the Jensen- Shannon divergence with arbitrary weights (1), (2),..., (), 2 D j1 j j n N j 1 2 log 2 p, 30 under the null hypothesis that p (1) p (2) p () p, where the sybol indicates that we neglect ters of the order of O(1/N 2 ). Let us now consider that choice of weights ( j) which iniizes the quadratic deviation of the observed fro the true Jensen-Shannon divergence DFDP 2 2 DDEDP As the second ter on the right hand side of Eq. 31 is of the order of O(1/N 2 ), the iniization of the quadratic deviation of the observed fro the true Jensen-Shannon divergence reduces to the iniization of the variance of the Jensen-Shannon divergence estiator. By using one Lagrange ultiplier for the noralization constraint j ( j) 1 we obtain that the set of weights ( j) n ( j) /N iniizes the variance of the Jensen-Shannon divergence D. This finding is intriguing, because this set of weights is i identical to the natural choice of weights in all of the three interpretations of D presented in Sec. III as well as ii identical to the special choice of weights that aes the bias of D independent of the subsequence lengths n ( j) Eq. 24. Furtherore, we find that for the special choice of weights ( j) n ( j) /N the variance of D vanishes in O(1/N). This eans that for the special choice of weights ( j) n ( j) /N the leading ter of 2 (D) decreases with the sequence length N as 1/N 2, whereas in general it decreases as 1/N. It is clear that for the special choice of weights ( j) n ( j) /N the O(1/N) ter of 2 (D) becoes independent of both n ( j) and p, and it is interesting that for this special choice of weights the O(1/N 2 ) ter of 2 (D) also turns out to be independent of both n ( j) and p. In contrast, we find that for the naive choice of weights ( j) 1/ the variance of D 1/ neither vanishes in O(1/N) nor does it becoe independent of the subsequence lengths n ( j), and we obtain for the variance of the Jensen-Shannon divergence D 1/, covhf,hf j 1 N 2 log 2 p 29 2 D 2 log 2 p N N 2 A1, 32 for all j1,2,...,, where 2 (log 2 p) denotes the variance of the nubers log 2 p i with respect to the probability distribution p i, and the sybol indicates that we neglect ters of the order of O(1/N 2 ). It is interesting to note that the first-order approxiation of the covariance between Hf where A j1 1/n ( j) denotes the haronic ean of the subsequence lengths n ( j). Note that the expression inside the parentheses on the right-hand side of Eq. 32 is siilar to the expression inside the parentheses on the right-hand side of Eq. 25. Hence, the variance of D 1/ shows a singular

8 IVO GROSSE et al. PHYSICAL REVIEW E behavior siilar to that of the ean of D 1/ when the length of at least one subsequence becoes very sall. C. Probability distribution of D Expression 24 provides a good criterion to tell whether an experientally observed Jensen-Shannon divergence D between frequency distributions is greater than expected by chance, but it does not tell if D is significantly greater than expected by chance. In this section we will derive the probability distribution of D in order to quantify the statistical significance of experientally observed values of D. Given an observed value of Dx, we will calculate the probability of obtaining this value or a lower value by chance under the null hypothesis that all sequences are generated fro the sae probability distribution. We call this probability the significance threshold of the given value x, and we denote it by sxprobdx. 33 As s(x) does not see to adit an easy analytical expression, we will obtain an approxiation by using the Taylor expansion x log 2 x a xa xa2 ln 2 a2 ln2 O xa3, 34 to approxiate D in ters of quadratic functions as follows: D j1 j1 j1 p j j p j i j i log 2 j p i p j i j j p i ln 2 pi j j p i j 2 j1 p i j 2ln2 35 j pi j p i j 2 p i j. 36 2ln2 It is interesting to note that in this quadratic approxiation of D there are no constant or linear ters because the first double su of Eq. 35 vanishes exactly due to noralization of the probability distributions p i ( j), p i, and ( j). If we express the 2 statistic 31 in the sae notation, we obtain 2 N pi j j p i j 2 j1 p i j 2Nln 2D. 37 The above 2 statistic is nown to converge for asyptotically large values of N to the 2 distribution with ( 1)(1) degrees of freedo 31. Hence, also 2N(ln2)D converges for asyptotically large values of N to the 2 distribution with (1)(1) degrees of freedo, i.e., we obtain for asyptotically large values of N the approxiation /2,Nln 2x sxf 2Nln 2x, 38 /2 where (a,x) and (a) denote the incoplete and coplete gaa functions, respectively 31,32. The fact that D can be interpreted as utual inforation agrees with Eq. 38, as it is nown that, up to a ultiplicative constant, the utual inforation converges for asyptotically large values of N to the 2 probability distribution with (1)(1) degrees of freedo 6. V. STATISTICAL PROPERTIES OF D ax Expression 38 gives the significance threshold of a single value of D coputed between two saples of fixed length. Fro the practical point of view this is equivalent to preselecting a fixed point that divides a sequence into two subsequences and asing for the probability that both subsequences have been generated fro different probability distributions. But, in general, when facing an unnown sequence we do not have any a priori nowledge of the location of the possible cutting point. The proble of finding the point where a nonstationary sequence can be ost liely divided into two stationary subsequences has been widely studied in atheatics. There, the proble is nown as the change-point proble 33 35, which consists of finding out i whether there exists a change point in the studied sequence, and ii at which position in the sequence the change point is located, provided it exists. Tas i corresponds to deterining whether the studied sequence is nonstationary, and tas ii corresponds to deterining the ost liely location of the nonstationarity, provided it exists. Since 2N(ln2)D can be interpreted as the log-lielihood ratio of the odel with change point and the odel without change point, the axiization of D along the sequence yields a natural way of deterining the ost liely location of the change point. Hence, we ove a cursor along the entire sequence, copute D between the subsequences at both sides of the cursor for all positions, and choose that position as the optial change point at which D reaches its axiu value D ax. In Sec. VI we describe a recursive segentation algorith that is based on this idea. The proble we will address in this section is to decide if the value D ax of the Jensen- Shannon divergence at the optial change point is sufficiently large to partition the sequence at that point, or if the value D ax is sufficiently sall to consider the entire sequence as stationary and not partition it at all. Hence, we will address in this section the proble of coputing the statistical significance of experientally observed values of D ax. Even if the studied sequence has been generated fro a single probability distribution, we find D ax 0 due to statistical fluctuations. Moreover, we find that D ax increases above any significance threshold s coputed in Sec. IV as N increases. To decide if the obtained value D ax x is statistically significant we need to copute the probability of obtaining this value or a lower value by chance in a rando sequence, i.e., we need to copute

9 ANALYSIS OF SYMBOLIC SEQUENCES USING THE... PHYSICAL REVIEW E s ax xprobd ax x. 39 Obviously s ax (x)s(x). In fact, if each value of D at each position of the cursor were independent of the others, we would obtain 36 s ax xsx N F 1 2Nln 2x N, 40 where N denotes the sequence length. Note that we are dealing with the coparison between only two distributions ( 2), and hence the nuber of degrees of freedo is 1. It is clear that the rando variables D sapled at different positions of the sae sequence are not statistically independent, because the value of D at a given position is alost identical to the value of D at the neighboring positions. For binary (2) i.i.d. sequences Horvath 37 derives an analytic expression for s ax (x) in the liit of asyptotically large sequence lengths N, and Csorgo and Horvath 38 generalize that result to arbitrary by deriving that the probability distribution function of Z N 2N(ln2)D ax converges for asyptotically large values of N to ProbA N Z N B N x 2 exp2e x, 41 where N denotes the sequence length, 1 denotes the nuber of degrees of freedo, A N is defined by and B N () is defined by A N 2lnlnN, B N 2 lnlnn 2 ln ln ln Nln 2. By converting Eq. 41 into our notation we obtain s ax xexp2e B N A N 2N ln 2x. 44 In the following paragraphs we test how accurately the asyptotic approxiation s ax (x) agrees with the finite-size histogra ŝ ax (x) obtained by Monte-Carlo siulations of sequences of length N ranging fro 10 2 to For each sequence length N10 2,10 4,10 6, and 10 8, we generate an enseble of 10 5 quaternary (4) i.i.d. sequences of length N, and for each sequence of each enseble we ove a cursor along the sequence and copute at each position 15nN15 the Jensen-Shannon divergence D 39. We define D ax as the axiu of all values of D coputed fro one sequence, and by collecting all values D ax of each enseble of 10 5 rando i.i.d. sequences of length N we obtain the histogras ŝ ax (x) for each N. Figure 3a shows the histogras ŝ ax (x) for 4 and N10 2, 10 4, 10 6, and 10 8 sybols together with the asyptotic approxiations s ax (x) solid lines. We find that the asyptotic approxiations s ax (x) are not very accurate, and that even for sequence lengths as large as N10 8 there is still a significant deviation between ŝ ax (x) and s ax (x). Figure 3a also shows that the deviations between ŝ ax (x) FIG. 3. Histogras ŝ ax (x) of x2n(ln2) D ax and their asyptotic approxiations s ax (x) obtained fro ensebles of 10 5 quaternary (4) i.i.d. sequences of length N10 2,10 4,10 6, and a shows that the asyptotic approxiations s ax (x) are not very accurate for finite-size sequences ranging in length N fro 10 2 to 10 8, and that the largest deviations between ŝ ax (x) and s ax (x) occur in the right tails of the distributions. b shows a plot of the differences between the histogras ŝ ax (x) and their asyptotic approxiations s ax (x) versus x2n(ln2) D ax. We find that the accuracy of the approxiations increases with increasing N, but that even for sequences of length N10 8 the deviations between ŝ ax (x) and s ax (x) are greater than and s ax (x) are particularly large in the right tail, where we desire both distributions agree particularly well. Figure 3b illustrates the deviations between ŝ ax (x) and s ax (x) by plotting ŝ ax (x)s ax (x) versus 2N(ln2)x. We find that the deviations between ŝ ax (x) and s ax (x) tend to becoe saller as the sequence length N increases, but even for sequences of length N10 8 the deviations between ŝ ax (x) and s ax (x) are greater than As the asyptotic approxiation s ax (x) is not very accurate for sequences ranging in length fro N10 2 to 10 8, we recruit Monte-Carlo siulations to obtain nuerical approxiations of ŝ ax (x) as a function of the sequence length N and the alphabet size. We find that the functional for of ŝ ax (x) sees to be very siilar to the functional for stated in Eq. 40 if we replace the sequence length N by an effective length N eff, and if we introduce a scaling factor 1, by which we ultiply the arguent of F 1. Specifically, we find that the probability distribution of D ax ay be approxiated by

10 IVO GROSSE et al. PHYSICAL REVIEW E s ax xsx N efff 1 2Nln 2x N eff. 45 N eff can be understood as the effective nuber of independent cutting points, and the scaling factor accoplishes that the variance of D ax is reduced due to correlations between the values of D coputed at different positions of the sae sequence. Note that, in principle, both paraeters N eff and depend on both N and. To find an approxiation of that dependence of N eff and on N and, we perfor the following siulations: 1 We generate, for a given alphabet size and a given sequence length N, an enseble of 10 5 rando i.i.d. sequences. 2 For each sequence, we ove a cursor along the sequence and copute at each position 15nN15 the Jensen-Shannon divergence D 39, and we define D ax as the axiu of all values of D coputed fro one sequence. 3 For each enseble of 10 5 rando i.i.d. sequences we obtain the histogra ŝ ax (x), and we fit the paraeters N eff and of s ax (x) given by expression 45 to ŝ ax (x) by iniizing the Kologorov-Sirnov distance ŝ ax (x)s ax (x). 4 We repeat the above procedure for different values of and N. Figure 4a shows the histogras ŝ ax (x) for 4 and N10 2, 10 4, 10 6, and 10 8 sybols together with the finite-size approxiation s ax (x) obtained by the above procedure. We find by visual inspection of Fig. 4a and by extensive analysis of the Kologorov-Sirnov distances between ŝ ax (x) and s ax (x) for varying fro 2 to 12 and N varying fro 10 2 to 10 8 that s ax (x) fro Eq. 45 provides a good approxiation of ŝ ax (x). Figure 4b shows the deviations between ŝ ax (x) and s ax (x) by plotting ŝ ax (x)s ax (x) versus 2N(ln2)x, and we find that the axiu deviation between ŝ ax (x) and s ax (x) stays below 0.02 for all of the cases we analyze, ranging fro 2 to12 and fro N10 2 to N10 8. Moreover, we find that the axiu deviation between ŝ ax (x) and s ax (x) stays below 0.01 if we restrict the coparison of ŝ ax (x) and s ax (x) to the right tails of the distributions, where we want the approxiations to be particularly accurate. Next, we study how the paraeters N eff and obtained by the fitting procedure described above depend on the alphabet size and the sequence length N. Figure 5 shows N eff and versus N for varying values of. First, we find that is practically independent of N. Second, we find that for each the effective nuber of cutting points N eff adits a good linear fit as a function of ln N, i.e., N eff a ln Nb. 46 Both paraeters a and b depend on the alphabet size, and we present the least-squares values of a and b as a function of in Table I. FIG. 4. Histogras ŝ ax (x) ofx2n(ln2) D ax and their finitesize approxiations s ax (x) obtained fro ensebles of 10 5 quaternary (4) i.i.d. sequences of length N10 2,10 4,10 6, and a shows that the approxiations s ax (x) are ore accurate for sequences of length N ranging fro 10 2 and 10 8 than the asyptotic approxiations s ax (x) presented in Fig. 3, and that the largest deviations between ŝ ax (x) and s ax (x) do not occur in the right tails of the distributions, which we desire to approxiate as accurately as possible. b shows a plot of the differences between the histogras ŝ ax (x) and their finite-size approxiations s ax (x) versus x2n(ln2) D ax. We find that the deviations between ŝ ax (x) and s ax (x) are saller than Moreover, we find that the deviations between ŝ ax (x) and s ax (x) are saller than 0.01 if we restrict the coparison of ŝ ax (x) and s ax (x) to the tails of the distributions, which we desire to approxiate as accurately as possible. VI. APPLICATIONS OF D In this section we illustrate how the results obtained in the previous sections ay be used to develop an algorith that can partition a nonstationary sequence into stationary subsequences. We describe this segentation algorith based on the Jensen-Shannon divergence D in detail, and we present three application exaples of this recursive segentation algorith. Many sequence analysis techniques rely on the stationarity of the analyzed sequence, i.e., they rely on the assuption that all portions of the sequence have at least the sae coposition. This a priori assuption is very often in conflict with experiental data, such as, for exaple, in case of DNA sequences 40. The algorith described here, which is an iproved version of the algorith presented in Refs. 13 and 18, allows us to decopose a nonstationary sequence

11 ANALYSIS OF SYMBOLIC SEQUENCES USING THE... PHYSICAL REVIEW E FIG. 5. Paraeter values of N eff squares and circles as a function of the sequence length N, ranging fro 200 to 10 5, for an alphabet size 4. We find that is alost independent of N, 0.80, while N eff adits a good linear fit to ln N. The least-squares fit to N eff a ln Nb yields a2.44 and b6.15. into stationary subsequences of hoogeneous coposition as follows: First, we ove along the sequence a cursor that divides at each position the sequence into two subsequences, and we copute D for each position of the cursor. We select that point at which D reaches its axiu value D ax, and we copute its statistical significance s ax. If this s ax exceeds a given threshold s 0, the sequence is cut at this point, and the procedure continues recursively for each of the two resulting subsequences. Otherwise, the sequence reains undivided. The process stops when none of the possible cutting points has a significance threshold exceeding s 0, and we say that the sequence is segented at significance threshold s 0. In the following three sections we present three exaples that illustrate this recursive segentation process. A. Segentation of a odel sequence with nown copositional doains In order to test if the segentation algorith wors, we generate a binary sequence of length obtained by joining patches of different length and coposition. We choose the sizes of the patches randoly fro a power-law distribution in order to obtain a wide range of different sizes, and we choose the coposition of the patches randoly fro a truncated Gaussian distribution centered at 1/2. To show graphically the variation in coposition along this sequence, we plot in Fig. 6 the wal of the sequence. Given a binary sequence y i,,...,n, where y i can assue the values 1 or1, the wal of the sequence at position n is defined by 41 TABLE I. Values of the paraeters a, b, and obtained by least-squares fitting of s ax (x) for three values of the alphabet size. a b FIG. 6. Segentation of a coputer generated binary sequence of length obtained by joining patches of different length and coposition. The solid line represents the wal of the sequence see text and the vertical dotted lines represent the locations of the cuts obtained by the recursive segentation procedure at significance threshold s 0 95%. We find that the recursive segentation procedure is indeed capable of partitioning the nonstationary input sequence into stationary subsequences at those points vertical dotted lines at which the local coposition of the sequence changes, indicated by changes of the slope of the sequence wal solid line. n wn y i. 47 Regions with a positive slope in Fig. 6 correspond to an abundance of 1 s, and regions with a negative slope correspond to an abundance of 1 s. We apply the segentation procedure presented above to this exaple sequence, and the vertical lines in Fig. 6 correspond to the cuts obtained by eans of the segentation procedure. Figure 6 shows clearly that the positions of the cuts coincide accurately with changes in the slope of w(n). Moreover, regions without any cut do not see to show a significant change of the slope of w(n). This observation allows us to conjecture that the subsequences obtained by the segentation procedure are indeed hoogeneous at the considered significance threshold. It is worth entioning that the ethod does not rely on any initial assuption about the size distribution of the subsequences, and as we can verify by inspecting Fig. 6 the resulting subsequences have indeed a great variety of sizes. B. Length distribution of copositionally stationary doains in proaryotic and euaryotic DNA In this subsection we present one exaple in which we apply the recursive segentation procedure to DNA sequences with the goal of studying the length distribution of copositionally stationary doains in proaryotic and euaryotic DNA. We segent at a significance threshold of s 0 95% the coplete genoe of the bacteriu Escherichia coli 42 with a length of base pairs bp as well as the huan ajor histocopatibility coplex MHC region of chroosoe 6 43 with a siilar size of bp. In both cases we use the natural four-letter alphabet A