Limiting spectral distribution of XX matrices

Transcription

1 Limiting spectral distribution of XX matrices Arup Bose Sreela Gangopadhyay Arnab Sen July 23, 2009 Abstract The methods to establish the limiting spectral distribution (LSD) of large dimensional random matrices includes the well known moment method which invokes the trace formula. Its success has been demonstrated in several types of matrices such as the Wigner matrix and the sample covariance matrix. In a recent article Bryc, Dembo and Jiang (2006) [7] establish the LSD for random Toeplitz and Hankel matrices using the moment method. They perform the necessary counting of terms in the trace by splitting the relevant sets into equivalence classes and relating the limits of the counts to certain volume calculations. Bose and Sen (2008) [6] have developed this method further and have provided a general framework which deals with symmetric matrices with entries coming from an independent sequence. In this article we enlarge the scope of the above approach to consider matrices of the form A p = n XX where X is a p n matrix with real entries. We establish some general results on the existence of the spectral distribution of such matrices, appropriately centered and scaled, when p and n = n(p) and p/n y with 0 y <. As examples we show the existence of the spectral distribution when X is taken to be the appropriate asymmetric Hankel, Toeplitz, circulant and reverse circulant matrices. In particular, when y = 0, the limits for all these matrices coincide and is the same as the limit for the symmetric Toeplitz derived in Bryc, Dembo and Jiang (2006) [7]. In other cases, we obtain new limiting spectral distributions for which no closed form expressions are known. We demonstrate the nature of these limits through some simulation results. Keywords: Large dimensional random matrix, eigenvalues, sample covariance matrix, Toeplitz matrix, Hankel matrix, circulant matrix, reverse circulant matrix, spectral distribution, bounded Lipschitz metric, limiting spectral distribution, moment method, volume method, almost sure convergence, convergence in distribution. AMS Subject Classification: 60F05, 60F5, 62E20, 60G57. Stat Math Unit, Kolkata, Indian Statistical Institute Stat Math Unit, Kolkata, Indian Statistical Institute Dept. of Statistics, U.C. Berkeley.

2 Introduction. Random matrices and spectral distribution Random matrices were introduced in mathematical statistics by Wishart (928) [20]. In 950 Wigner successfully used random matrices to model the nuclear energy levels of a disordered particle system. The seminal work, Wigner (958) [9] laid the foundation of this subject. This theory has grown into a separate field now, with application in many branches of sciences that includes areas as diverse as multivariate statistics, operator algebra, number theory, signal processing and wireless communication. Spectral properties of random matrices is in general a rich and attractive area for mathematicians. In particular, when the dimension of the random matrix is large, the problem of studying the behavior of the eigenvalues in the bulk of the spectrum has arisen naturally in many areas of science and has received considerable attention. Suppose A p is a p p random matrix. Let λ,...,λ p R (or C) denote its eigenvalues. When the eigenvalues are real, our convention will be to always write them in ascending order. The empirical spectral measure µ p of A p is the random measure on R (or C ) given by µ p = p p δ λi, () i= where δ x is the Dirac delta measure at x. The random probability distribution function on R (or C) corresponding to µ p, is known as the Empirical Spectral Distribution (ESD) of A p. We will denote it by F Ap. If {F Ap } converges weakly (as p tends to infinity), either almost surely or in probability, to some (nonrandom) probability distribution, then that distribution is called the Limiting Spectral Distribution (LSD) of {A p }. Proving existence of LSD and deriving its properties for general patterned matrices has drawn significant attention in the literature. We refer to Bai (997) [] Bose and Sen (2008) [6] for information on several interesting situations where the LSD exists and can be explicitly specified. Possibly the two most significant matrices whose LSD have been extensively studied, are the Wigner and the sample covariance matrices. For simplicity, let us assume for the time being that all random sequences under consideration are independent and uniformly bounded.. Wigner matrix. In its simplest form, the Wigner matrix W n (s) (Wigner, 955, 958) [8, 9], of order n is an n n symmetric matrix whose entries on and above the diagonal are i.i.d. random variables with zero mean and variance one. Denoting those i.i.d. random variables by {x ij : i j}, we can visualize the Wigner matrix as W (s) n = It is well known that almost surely, x x 2 x 3... x (n ) x n x 2 x 22 x x 2(n ) x 2n.. (2) x n x 2n x 3n... x n(n ) x nn LSD of n /2 W (s) n is the semicircle law W, (3) with the density function p W (s) = 2π 4 s 2 if s 2, 0 otherwise. 2 (4)

3 2. Sample covariance matrix (S matrix). Suppose {x jk : j,k =,2,...} is a double array of i.i.d. real random variables with mean zero and variance. The matrix A p (W) = n W p W p where W p = ((x ij )) i p, j n (5) is called a sample covariance matrix (in short an S matrix). Let I p denote the identity matrix of order p. The following results are well known. See Bai (999) [] or Bose and Sen (2008) [6] for moment method and Stieltjes transform method proofs. To have the historical idea of the successive development of these results see Bai and Yin (988) [3] for (a) and Marčenko, and Pastur (967) [3], Grenander and Silverstein (977) [9], Wachter (978) [7], Jonsson (982) [], Yin and Krishnaiah (985) [22] and Yin (986) [2] for (b). (a) If p and p/n 0 then almost surely, LSD of n p (A p(w) I p ) is the semicircle law given in (4) above. (6) (b) If p and p/n y (0, ) then almost surely, LSD of A p (W) is the Marčenko-Pastur law MP y given in (8) below. (7) The Marčenko-Pastur law, MP y, has a positive mass y at the origin if y >. Elsewhere it has a density: 2πxy (b x)(x a) if a x b, MP y (x) = (8) 0 otherwise where a = a(y) = ( y) 2 and b = b(y) = ( + y) 2. Suppose p = n. Then W n above is a square matrix with i.i.d. elements, and the Wigner matrix W n (s) is obtained from W n by retaining the elements above and on the diagonal of W n and letting symmetry dictate the choice of the rest of the elements of W n (s). We loosely say that W n is the asymmetric version of W n (s). It is interesting to note the following: (i) The LSD of n /2 W (s) n in (3) and that of n p ( n W pw p I p) in (6) are identical. (ii) If W and M are random variables obeying respectively, the semicircle law and the Marčenko-Pastur law with y =, then M = D W 2 in law. That is, the LSD of n /2 W n (s) and that of n W pw p bears this squaring relationship when p/n..2 The problem and its motivation In view of the above discussion, it thus is natural to study the LSD of matrices of the form A p (X) = (/n)x p X p where X p is a p n suitably patterned (asymmetric) random matrix. Asymmetry is used loosely. It just means that X p is not necessarily symmetric. In particular one may ask the following questions. (i) Suppose p/n y, 0 < y <. When does the LSD of A p (X) = n X px p exist? (ii) Suppose p/n 0. When does the LSD of n p (A p(x) I p ) exist? Note that unlike the S matrix, the mean of A p (X) is not in general equal to I p and hence other centering/normalisations may also need to 3

4 be allowed. (iii) Suppose X p is a p n (asymmetric) patterned matrix and A p (X) = n X px p for which the limit in (ii) holds. Call it A. Now consider the square symmetric matrix X n (s) obtained from X p by letting p = n and symmetrising it by retaining the elements of X p above and on the diagonal and allowing symmetry to dictate the rest of the elements. Suppose as n, the LSD of n /2 X n (s) exists. Call it B. In general, is there any relation between B and A? A further motivation to study the LSD of n X px p comes from wireless communications theory. Many results on the information-theoretic limits of various wireless communication channels make substantial use of asymptotic random matrix theory, as the size of the matrix increases. An extended survey of results and works in this area may be found in Tulino and Verdu (2004) [6]. Also see Silverstein and Tulino (2006) [5] for a review of some of the existing mathematical results that are relevant to the analysis of the properties of random matrices arising in wireless communications. The study of the asymptotic distribution of the singular values is seen as an important aspect in the analysis and design of wireless communication channels. A typical wireless communication channel may be described by the linear vector memoryless channel: y = X p θ + ǫ where θ is the n-dimensional vector of the signal input, y is the p-dimensional vector of the signal output, and the p dimensional vector ǫ is the additive noise. X p, in turn, is the p n random matrix, generally with complex entries. Silverstein and Tulino (2006) [5] emphasize the asymptotic distribution of the squared singular-values of X p under various assumptions on the joint distribution of the random matrix coefficients where n and p tend to infinity while the aspect ratio, p/n y, 0 < y <. In their model, generally speaking, the channel matrix X p can be viewed as X p = f(a,a 2,...,A k ) where {A i } are some independent random (rectangular) matrices with complex entries, each having its own meaning in terms of the channel. In most of the cases they studied, some A i have all i.i.d. entries, while the LSD of the other matrices A j A j,j i are assumed to exist. Then the LSD of X p X p is computed in terms of the LSD of A j A j,j i. Specifically, certain CDMA channels can be modeled (e.g. see Tulino and Verdu (2004) [6, Chapter 3], Silverstein and Tulino (2006) [5]) as X p = CSA where C is a p p random symmetric Toeplitz matrix, S is a matrix with i.i.d. complex entries independent of C, and A is an n n deterministic diagonal matrix. One of the main theorems in Silverstein and Tulino (2006) [5] establishes the LSD of X p = CSA for random p p matrix C, not necessarily Toeplitz, under the added assumption that the LSD of the matrices CC exists. When C is a random symmetric Topelitz matrix, the existence of the LSD of CC is immediate from the recent work of Bryc, Dembo and Jiang (2006) [7] and, Hammond and Miller (2005) [0]. This also motivates us to study the LSD of X p X p matrices in some generality where X p is not necessarily square and symmetric..3 The moment method We shall use the method of moments to study this convergence. To briefly describe this method, suppose {Y p } is a sequence of random variables with distribution functions {F p } such that E(Yp h ) β h for every positive integer h, and {β h } satisfies Carleman s condition (see Feller, 966, page 224) [8]: β /2h 2h =. (9) h= 4

5 Then there exists a distribution function F, such that for all h, β h = β h (F) = x h df(x) (0) and {Y p } (or equivalently {F p }) converges to F in distribution. We will often in short write β h when the underlying distribution F is clear from the context. Now suppose {A p } is a sequence of p p symmetric matrices (with possibly random entries), and let, by a slight abuse of notation, β h (A p ) denote the h-th moment of the ESD of A p. Suppose there is a sequence of nonrandom {β h } h= satisfying Carleman s condition such that, (M) First moment condition: For every h, E[β h (A p )] β h and (M2) Second moment condition: For every h, Var[β h (A p )] 0. Then the LSD is identified by {β h } h= and the convergence to LSD holds in probability. This convergence may be strengthened to almost sure convergence by strengthening (M2), for example, by replacing variance by the fourth moment and showing that (M4) Fourth moment condition: For every h, E[β h (A p ) E(β h (A p ))] 4 = O(p 2 ). For any symmetric p p matrix A with eigenvalues λ,... λ p, the trace formula p β h (A) = p λ h i = p Tr(A h ) i= is invoked to compute moments. Moment method proofs are known for the results on Wigner matrix and the sample covariance matrix given earlier. See for example, Bose and Sen (2008) [6]..4 Brief overview of our results Let Z be the set of all integers. Let {x α } α Z be an input sequence where Z = Z or Z 2 and {x α } are independent with Ex α = 0 and Ex 2 α =. Let L p : {,2,...,p} {,2,...,n = n(p)} Z be a sequence of functions which we call link functions. For simplicity of notation, we will write L for L p. We shall later impose suitable restriction on the link functions L p and sufficient probabilistic assumptions on the variables. Consider the matrices X p = ((x Lp(i,j))) i p, j n and A p = A p (X) = (/n)x p X p. The function L p defines an appropriate pattern and we may call these matrices patterned matrices. The assumptions on the link function restrict the number and manner in which any element of the input sequence may occur in the matrix. We shall consider two different regimes: Regime I. p, p/n y (0, ). In this case, we consider the spectral distribution of A p = A p (X) = n X p X p. Regime II. p, p/n 0. In this case we consider the spectral distribution of (np) /2 (X p X p ni p ) = n p (A p I p ) where I p is the identity matrix of order p. 5

6 We will use the following assumptions on the input sequences in two regimes. Assumption R. In Regime I, {x α } are independent with mean zero and variance one, and are either uniformly bounded or identically distributed. Assumption R2. In Regime II, {x α } are independent with mean zero and variance. Further, λ is such that p = O(n /λ ) and sup α E( x α 4(+/λ)+δ ) < for some δ > 0. The moment method requires all moments to be finite. However, in Regime I, our basic assumption will be that the input sequence has finite second moment. To deal with this, we first prove a truncation result which shows that in Regime I, under appropriate assumption on the link function, without loss of generality, we may assume the input sequence to be uniformly bounded. The situation in Regime II is significantly more involved and there we assume existence of higher moments. We then establish a negligibility result which implies that for nonzero contribution to the limiting moments, only summands in the trace formula which are pair matched matter. See next section for details on matching. The existence of the limit of the sum of pair matched terms and then the computation of the limit establishes the LSD. Quite interestingly, under reasonable assumptions on the allowed pattern, if the limit of empirical moments exist, they automatically satisfy Carleman s condition and thus ensures the existence of the LSD. However, we are unaware of any general existing method of imposing suitable restrictions on the link function to guarantee the existence of limits of moments. As examples of our method, we let X p to be the nonsymmetric Toeplitz, Hankel, reverse circulant and circulant matrices. We show that the LSD exists in both Regimes I and II, thereby answering questions (i) and (ii) in the previous section affirmatively for these matrices. The LSD in Regime I are all new. However, in Regime II, the LSD of all four matrices above are identical to the LSD obtained by Bryc, Dembo and Jiang (2006) [7] for the symmetric Toeplitz matrix T n (s). This implies that the answer to question (iii) is not straightforward and needs further investigation. Closed form expressions for the LSD or for its moments do not seem to be easily obtainable. We provide a few simulations to demonstrate the nature of the LSD and it is an interesting problem to derive detailed properties of the LSD in any of these cases..5 Examples In this section we will consider some specific nonsymmetric matrices of interest. In particular, let X p be the asymmetric versions of the four matrices Toeplitz, Hankel, circulant and reverse circulant. Then it turns out that in Regime II, the LSD for n p (A p(x p ) I p ) exists and is identical to the LSD for n /2 T n (s) obtained by Bryc, Dembo and Jiang (2006) [7] for the symmetric n n Toeplitz matrix T (s) n. In Regime I, the LSD for A p (X p ) exist but they are all different. The proofs of the results given below are presented in Section The asymmetric Toeplitz matrix The n n symmetric Toeplitz matrix is defined as T n (s) = ((x i j )). It is known that if the input sequence {x i } has mean zero and variance one and is either independent and uniformly bounded or, is i.i.d., then n /2 T (s) n converges weakly almost surely to L T (say). 6

7 See for example Bose and Sen (2008) [6], Bryc, Dembo and Jiang (2006) [7] and Hammond and Miller (2005) [0]. It may be noted that the moments of L T may be represented as volumes of certain subsets of the unit hypercubes but are not known in any explicit form. Let T p = T = ((x i j )) p n, A p (T) = n T p T p. Note that T p is the nonsymmetric Toeplitz matrix. Theorem (i) [Regime I] Assume R holds. If p n y (0, ) then empirical spectral distribution F Ap(T) converges in distribution almost surely to a nonrandom distribution which does not depend on the distribution of {x i }. (ii) [Regime II] Assume R2 holds. Then F q n p (Ap(T) Ip) converges in distribution to L T almost surely..5.2 The asymmetric Hankel matrix Suppose {x i,i =, ±, ±2,...} is an independent input sequence. Let H n (s) = ((x i+j )) be the n n symmetric Hankel matrix. It is known that if {x i } has mean zero and variance one and is either uniformly bounded or i.i.d., then F n /2 H n (s) converges weakly almost surely. See for example Bryc, Dembo and Jiang (2006) [7] and Bose and Sen (2008) [6]. Now let H = H p n be the asymmetric Hankel matrix where the (i,j)th entry is x i+j if i > j and x (i+j) if i j. Let H p (s) = ((x i+j )) p,n be the rectangular Hankel matrix with symmetric link function. Let We then have the following Theorem. A p (H) = n H p H p and A p(h (s) ) = n H (s) p H(s) p. Theorem 2 (i) [Regime I] Assume R holds. If p n y (0, ) then F Ap(H) converges almost surely to a nonrandom distribution which does not depend on the distribution of {x i }. (ii) [Regime II] Assume R2 holds. Then F q n p (Ap(H) Ip) converges almost surely to L T. The same limit continues to hold if A p (H) is replaced A p (H (s) )..5.3 The asymmetric reverse circulant matrix The symmetric Reverse Circulant R (s) n has the link function L(i,j) = (i+j) mod n. The LSD of n /2 R (s) n has been discussed in Bose and Mitra (2003) [5] and Bose and Sen (2008) [6]. In particular it is known that if {x i } are independent with mean zero and variance and are either (i) uniformly bounded or (ii) identically distributed, then the LSD R of n /2 R (s) n exists almost surely and has the density, f R (x) = x exp( x 2 ), < x < with moments β 2h+ (R) = 0 and β 2h (R) = h! for all h 0. 7

8 Let R p (s) be the p n matrix with link function L(i,j) = (i + j) mod n and R p = R p n be the asymmetric version of R p (s) with the link function L(i,j) = (i + j) mod n for i j = [(i + j) mod n] for i > j. So, in effect, the rows of R p are the first p rows of an (asymmetric) Reverse Circulant matrix. Let We then have the following theorem. A p (R) = n R p R p and A p(r (s) ) = n R (s) p R(s) p. Theorem 3 (i) [Regime I] Assume R holds. If p n y (0, ) then F Ap(R) converges almost surely to a nonrandom distribution which does not depend on the distribution of {x i }. (ii) [Regime II] Assume R2 holds. Then F q n p (Ap(R) Ip) converges almost surely to L T. The same limit continues to hold if A p (R) is replaced A p (R (s) ). ().6 The asymmetric circulant matrix The square circulant matrix is well known in the literature. Its eigenvalues can be explicitly computed and are closely related to the periodogram. Its LSD is the bivariate normal distribution. See for example Bose and Mitra [5]. Its symmetric version is treated in Bose and Sen (2008) [6]. Let C p = C p n = ((x L(i,j) )) where L(i,j) = (n i + j) mod n, and A p (C) = p C p C p. Theorem 4 (i) [Regime I] Assume R holds. If p n y (0, ) then F Ap(C) converges almost surely to a nonrandom distribution which does not depend on the distribution of {x i }. (ii) [Regime II] Assume R2 holds. Then F q n p (Ap(C) Ip) converges almost surely to L T. 2 Basic notation, definitions and assumptions Examples of link functions that correspond to matrices of interest are as follows: 0. Covariance matrix: L p (i,j) = (i,j).. Asymmetric Toeplitz matrix: L p (i,j) = i j. 2. Asymmetric Hankel matrix: L p (i,j) = sgn(i j)(i + j) where sgn(l) = { if l 0, if l < 0 (2) 3. Asymmetric Reverse Circulant: L p (i,j) = sgn(i j)(i + j) mod n. 4. Asymmetric Circulant: L p (i,j) = (n + j i) mod n. For any set G, #G and G will stand for the number of elements of G. We shall use the following general assumptions on the link function. Assumptions on link function L: 8

9 A. There exists positive integer Φ such that for any α Z and p, (i) #{i : L p (i,j) = α} Φ for all j {,2,...,n} and (ii) #{j : L p (i,j) = α} Φ for all i {,2,...,p}. A. For any α Z and p, #{i : L p (i,j) = α} for all j {,2,...,n}. B. k p α p = O(np) where k p = #{α : L p (i,j) = α, i p, j n}, α p = max α Z L p (α). Assumption A stipulates that with increasing dimensions, the number of times any fixed input variable appears in any given row or column, remains bounded. Assumption A stipulates that no input appears more than once in any column. Assumption B makes sure that no particular input appears too many times in the matrix. Clearly A implies A(i) but B is not related to either A or A. Consider the link function L p (i,j) = (,) if i = j and L p (i,j) = (i,j) if i j. This link function satisfies A(ii) and A but not B. On the other hand if L p (i,j) = for all (i,j) then it satisfies B but not A. It is easy to verify that the link functions listed above satisfy these assumptions. It may also be noted that the symmetric Toeplitz link function does not satisfy Assumption A. Towards applying the moment method, we need a few notions, most of which are given in details in Bose and Sen (2008) [6] in the context of symmetric matrices. These concepts and definitions will remain valid in Regime II also, with appropriate changes. The trace formula. Let A p = n X p X p. Then the h-th moment of ESD of A p is given by p Tr A h p = p n h i,i 2,...,i h n x Lp(i,i 2 )x Lp(i 3,i 2 ) x Lp(i 2h,i 2h )x Lp(i,i 2h ). (3) Circuits. Any function π : {0,,2,,2h} Z + is said to be a circuit if (i) π(0) = π(2h), (ii) π(2i) p 0 i h and (iii) π(2i ) n i h. The length l(π) of π is taken to be (2h). A circuit depends on h and p but we will suppress this dependence. Matched Circuits. Let ξ π (2i ) = L(π(2i 2),π(2i )), i h and ξ π (2i) = L(π(2i),π(2i )), i h. These will be called L-values. A circuit π is said to have an edge of order e ( e 2h) if it has an L-value repeated exactly e times. Any π with all e 2 will be called L-matched (in short matched). For any such π, given any i, there is at least one j i such that ξ π (i) = ξ π (j). Define for any circuit π, h X π = x ξπ(2i )x ξπ(2i). (4) i= 9

10 Due to the mean zero and independence assumption, if π has at least one edge of order one then E(X π ) = 0. Equivalence relation on circuits. Two circuits π and π 2 of same length are said to be equivalent if their L values agree at exactly the same pairs (i,j). That is, iff { ξ π (i) = ξ π (j) ξ π2 (i) = ξ π2 (j) }. This defines an equivalence relation between the circuits. Words. Equivalence classes arising from the above equivalence relation may be identified with partitions of {,2,,2h} to any partition we associate a word w of length l(w) = 2h of letters where the first occurrence of each letter is in alphabetical order. For example, if h = 3, then the partition {{,3,6}, {2,4}, {5}} is represented by the word ababca. Let w denote the number of different letters that appear in w. The notion of order e edges, matching, nonmatching for π, carries over to words in a natural manner. For instance, the word ababa is matched. The word abcadbaa is nonmatched, has edges of order, 2 and 4 and the corresponding partition is {{,4,7,8}, {2,6}, {3}, {5}}. As pointed out, it will be enough to consider only matched words since E(X π ) = 0 for nonmatched words. The following fact is obvious. Word Count. #{w : w is of length 2h and has only order 2 edges with w = h} = (2h)! 2 h h!. (5) The class Π(w). Let w[i] denote the i-th entry of w. The equivalence class corresponding to w will be denoted by Π(w) = {π : w[i] = w[j] ξ π (i) = ξ π (j)}. The number of partition blocks corresponding to w will be denoted by w. If π Π(w), then clearly the number of distinct L-values equals w. Notationally, #{ξ π (i) : i 2h} = w. Note that now, with the above definition, we may rewrite the trace formula as p Tr A h p = p n h X π, π: π circuit X π = p n h w π Π(w) where the first sum is taken over all words with length (2h). After taking expectation, only the matched words survive and E [ p Tr A h ] p = p n h w matched EX π. π Π(w) To deal with (M4), we need multiple circuits. The following notions will be useful for dealing with multiple circuits: Jointly Matched and Cross-matched Circuits. k circuits π,π 2,,π k are said to be jointly matched if each L-value occurs at least twice across all circuits. They are said to be cross-matched if each circuit has at least one L-value which occurs in at least one of the other circuits. The class Π (w). Define for any (matched) word w, Π (w) = {π : w[i] = w[j] ξ π (i) = ξ π (j)}. (6) 0

11 Note that Π (w) Π(w). However, as we will see, Π (w) is equivalent to Π(w) for asymptotic considerations, but is easier to work with. Vertex and generating vertex. Each π(i) of a circuit π will be called a vertex. Further, π(2i),0 i h will be called even vertices or p-vertices and π(2i ), i h will be called odd vertices or n-vertices. A vertex π is said to be generating if either i = 0 or w[i] is the position of the first occurrence of a letter. For example, if w = abbcab then π(0), π(), π(2), π(4) are generating vertices. We will call an odd generating vertex π(2i ) Type I if π(2i) is also an even generating vertex. Otherwise we will call π(2i ) a Type II odd generating vertex. Obviously, number of generating vertices in π is w +. By Assumption A on L function given earlier, a circuit of a fixed length is completely determined, up to a finitely many choices by its generating vertices. Hence, in Regime I under the assumption A, we obtain the simple but crucial estimate Π (w) = O(n w + ). Let [a] denote the integer part of a. For Regime II, we will further assume A. In that case, it shall be shown that Π w + +[ (w) = O(p 2 ] n [ w 2 ] ). 3 Regime I In Regime I, p such that p/n y (0, ) and we consider the spectral distribution of A p = A p (X) = n X p X p. There are two main theorems in this section. Theorem 5 is proved under Assumption B. We show that the input sequence may be taken to be bounded without loss of generality. This allows us to deal with bounded sequences in all examples later. This result is proved sketchily in Bose and Sen (2008) [6], with special reference to the sample covariance matrix. We provide a detailed proof for clarity and completeness. Theorem 6 is proved under Assumptions A and B. We show that the ESD of {A p (X)} is almost surely tight and any subsequential limit is sub Gaussian. Invoking the trace formula, we show that the LSD exists iff the moments converge. Further, in the limit, only pair matched terms potentially contribute. To prove Theorem 5, we will need the following notion and result. The bounded Lipschitz metric d BL is defined on the space of probability measures as: d BL (µ,ν) = sup{ fdµ fdν : f + f L } (7) where f = sup x f(x), f L = sup f(x) f(y) / x y. x =y Recall that convergence in d BL implies weak convergence of measures. The following inequalities provide estimate of the metric distance d BL in terms of trace. Proofs may be found in Bai and Silverstein (2006) [2] or Bai (999) [] and uses Lidskii s theorem (see Bhatia, 997, page 69) [4]. Lemma (i) Suppose A, B are n n symmetric real matrices. Then d 2 BL(F A,F B ) ( n n i= ) 2 λ i (A) λ i (B) n n (λ i (A) λ i (B)) 2 n Tr(A B)2. (8) i=

12 (ii) For any square matrix S, let F S denote its empirical spectral distribution. Suppose X and Y are p n real matrices. Let A = XX and B = Y Y Then d 2 BL (F A,F B ) ( p p i= ) 2 λ i (A) λ i (B) 2 p 2 Tr(A + B)Tr[(X Y )(X Y ) ]. (9) Theorem 5 Let p, p n y (0, ) and Assumption B holds so that α pk p = O(np). Suppose for every bounded mean zero and variance one i.i.d. input sequence, F n X px p converges weakly to some fixed nonrandom distribution G almost surely. Then the same limit continues to hold if the input sequence is i.i.d. with mean zero and variance one. Proof of Theorem 5. Without loss of generality we shall assume that Z = Z and also write X for X p. For t > 0, denote µ(t) = E[x 0 I( x 0 > t)] = E[x 0 I( x 0 t)] and let σ 2 (t) = Var(x 0 I( x 0 t)) = E[x 2 0I( x 0 t)] µ(t) 2. Since E(x 0 ) = 0 and E(x 2 0 ) =, we have µ(t) 0 and σ(t) as t and σ2 (t). Define bounded random variables x i = x ii( x i t) + µ(t) σ(t) = x i x i σ(t) where x i = x i I( x i > t) µ(t) = x i σ(t)x i. (20) It is easy to see that E( x 2 0 ) = σ2 (t) µ(t) 2 0 as t tends to infinity. Further, {x i } are i.i.d. bounded, mean zero and variance one random variables. Let us replace the entries x Lp(i,j) of the matrix X p by the truncated version x L (respectively x p(i,j) L p(i,j) ) and denote this matrix by Y (respectively X p ). By triangle inequality and (9), d 2 ( BL F n X px ) p,f n Y Y 2d 2 ( BL F n X px ) ( ) p,f n σ(t) 2 Y Y + 2d 2 BL F n Y Y,F n σ(t) 2 Y Y 2 p 2 n 2 Tr(X px p + σ(t)2 Y Y )Tr(X p σ(t)y )(X p σ(t)y ) + 2 p 2 n 2 Tr(Y Y + σ(t) 2 Y Y )Tr(σ(t)Y Y )(σ(t)y Y ). To tackle the first term on the right side above, Tr(X p X p + σ(t)2 Y Y ) n k= x2 L + p(i,k) σ(t)2 p = p i= α n k p ( kp i= x2 i k p α n k p ( kp i= x2 i k p ) + p i= k= i= ) + α n k p ( kp n k= x 2 L p(i,k) n (x Lp(i,k) x Lp(i,k)) 2 i= x2 i I{ x i > t} k p ) ( kp i= + µ(t) α p k x ) ( kp ) i i= p + α p k x2 i p. k p k p Therefore using α p k p = O(np) and the SLLN, we can see that (np) (Tr(X p X p + σ(t)2 Y Y )) is bounded. Now, 2

13 np Tr(X p σ(t)y )(X p σ(t)y ) = np Tr( X p X p ) ( np α pk p k p which is bounded by CE( x 2 i ) almost surely, for some constant C. Here we use the condition α pk p = O(np) and SLLN for { x 2 i }. Since E( x2 i ) 0 as t we can make the right side tend to 0 almost surely, by first letting p tend to infinity, and then letting t tend to infinity. This takes care of the first term. To tackle the second term, 2 [ ( T 2 = Tr (σ(t) 2 n 2 p 2 )Y Y + Y Y ) ) Tr ( (σ(t)y Y )(σ(t)y Y ) )] 2 = n 2 p 2(σ(t)2 + )(σ(t) ) 2 (Tr(Y Y )) 2 2 p n n 2 p 2(σ(t)2 + )(σ(t) ) 2 ( x 2 L p(i,k) )2 [ i= k=( kp ) ( kp ) 2 (σ(t) )2 i= n 2 p 2(σ(t)2 + ) σ(t) 2 α p k x2 i i= p + α p k x2 i I{ x i > t} p k p k p ( kp ) ( i= + µ(t) α p k x kp )] 2 i i= p + α p k x2 i p. k p k p Again using the condition α p k p = O(np) and σ(t) as t we get T 2 0 almost surely. This completes the proof of the theorem. To investigate the existence of the LSD of A p = A p (X) = (/n)x p X p, in view of Theorem 5, we shall assume that the input sequence is i.i.d. bounded, with mean zero and variance. Recall the two formulae given earlier: p Tr A h p = p n h X π, π: π circuit X π = p n h w π Π(w) k p i= x 2 i ), where the first sum is taken over all words with length (2h) and E [ p Tr A h ] p = p n h w matched π Π(w) EX π. Theorem 6 Let A p = (/n)x p X p where the entries of X p are bounded, independent with mean zero and variance and L p satisfies Properties A and B. Let p/n y, 0 < y <. Then (i) If w is matched word of length (2h) with an edge of order 3, then p n h π Π(w) EX π 0. (ii) For each h, [ E p Tr A h ] p p n h Π (w) 0. w matched w =h (iii) For each h, p Tr A h p E [ p Tr A h ] a.s. p 0. (iv) If we denote β h = lim sup p w matched p n h Π (w), then {β h } h satisfies Carleman s condition. w =h 3

14 The sequence of ESD, {F Ap(X) } is almost surely tight. Any subsequential limit is sub Gaussian. The LSD exists iff lim p w matched p n h Π (w) exists for each h. The same LSD continues to hold if the input w =h sequence is i.i.d. (not necessarily bounded) with mean zero and variance one. Remark Without assuming anything further on the link function L p, the above limits need not exist. However in Section.5, we have seen several examples where the limits do indeed exist. Proof of Theorem 6. For the special case of covariance matrix, a proof can be found in Bose and Sen (2008) [6]. The same arguments may be adapted to general link functions. For the sake of completeness, here are the essential steps. Recall that circuits which have at least one edge of order contribute zero. Thus, consider all circuits which have at least one edge of order at least 3 and all other edges of order at least 2. Let N h,3 + be the number of such circuits of length (2h). Suppose first that p = n. Then y =. In this case π has uniform range, π(i) n, i 2h. Then, from the arguments of Bryc, Dembo and Jiang (2006) [7], n (+h) N h, Now for general y > 0, the range of π(i) is not same for every i. For odd vertices, it is from to n and for even vertices, it is to p. However, this case is easily reduced to the previous case (p = n) as follows: let Π(w) be the possibly larger class of same circuits but with range π(i) max(p,n), i 2h. Then, there is a constant C, such that p n h w has one edge of order at least 3 Π(w) C[max(p,n)] (h+) w has one edge of order at least 3 from the previous case. This proves (i). Statement (ii) is then a consequence. For (iii), it is enough to show that E [ p Tr A h p Ep Tr A h p] 4 = O(p 2 ). Π(w) 0 The proof is essentially same as the proof of Lemma 2 (b) in Bose and Sen (2008) [6]. For the sake of completeness, we give the full proof here. We write the fourth moment as p 4 E[ TrA h p E(TrA h p) ] 4 = p 4 n 4h π,π 2,π 3,π 4 E[ 4 (X πi EX πi )]. If (π,π 2,π 3,π 4 ) are not jointly matched, then one of the circuits, say π j, has an L-value which does not occur anywhere else. Also note that EX πj = 0. Hence, using independence, E[ 4 i= (X π i EX πi )] = E[X 4 πj i=,i j (X π i EX πi )] = 0. Further, if (π,π 2,π 3,π 4 ) is jointly matched but is not cross-matched then one of the circuits, say π j is only self-matched, that is, none of its L-values is shared with those of the other circuits. Then again by independence, E[ 4 (X πi EX πi )] = E[(X πj EX πj )]E[ i= 4 i=,i j i= (X πi EX πi )] = 0. So it is clear that for non-zero contribution, the quadruple of circuits must be jointly matched and cross matched. Observe that the total number of edges in each circuit of the quadruple is (2h). So total number of 4

15 edges over 4 circuits is (8h). Since they are at least pair matched, there can be at most (4h) distinct edges i.e. distinct L-values. In Bryc, Dembo and Jiang (2006) [7] the authors estimated the number of quadruples of circuits which are jointly matched and cross matched. They used a method of counting which they applied only to Toeplitz and Hankel link function. But that method works as well for any general link function satisfying Assumption A. Using this method, we can say that apart from π (0), π 2 (0), π 3 (0) and π 4 (0) which are always generating vertices, there can be at most (4h 2) many generating vertices to determine (4h) distinct L-values. So total number of generating vertices obtained using Bryc, Dembo and Jiang (2006) [7] method is at most (4h + 2). Since p n y, 0 < y <, it is easy to see that there is a constant K which depends on y and h, such that p 4 E[ Tr A h p E(Tr A h p) ] 4 p 4h+2 K p 4h+4 = O(p 2 ). (2) This guarantees that if there is convergence, it is almost sure. So (iii) is proved. By Assumption A, for any matched word w of length (2h) with w = h, we have Π (w) n h+ Φ h which combined with the fact (5) yields the validity of Carleman s condition, proving (iv). The other claims of the theorem are easy consequences of all the discussion so far. 4 Regime II In this case, p and p/n 0. From the experience of the existing results for the sample covariance matrix (see for example Bai and Silverstein (2006) [2]), the following scaled and centered version is required. Define N p = N p (X) = ( n p )/2 ( n n X px p I p) = p (A p(x) I p ). (22) Straightforward extension of Theorem 5 is not possible to this situation. In Theorem 7 we show how the matrix N p defined above may be approximated by the following matrix B p = B p (X) with bounded entries under additional moment assumption. (B p ) ij = n ˆx np L(i,l)ˆx L(j,l), if i j and (B p ) ii = 0, (23) where x α = x α I( x α ǫ p n /4 ), ˆx α = x α E x α and ǫ p will be chosen later. l= Theorem 8 establishes the relationship between the existence of the LSD and the limits of the empirical moments. In Theorem 9 we prove an interesting result in Regime II. Suppose L () and L (2) are two link functions satisfying Assumptions A(ii), A and B and they agree on the set {(i,j) : i p, j n and i < j}. Then we show that the corresponding matrices N p (X) have identical LSD. In Theorem 0 we work with Assumption A and show the closeness of LSD in probability when two link functions agree on the above set. Theorem 7 Let p,n so that p/n 0 and Assumption R2 holds.suppose L p satisfies Assumptions A and B. Then there exists a nonrandom sequence {ǫ p } with the property that ǫ p 0 but ǫ p p /4 as p such that D(F Bp,F Np ) 0 a.s. where D is the metric for convergence in distribution on the space of all probability distribution functions. 5

16 Proof of Theorem 7. Let X p = (( x L(i,j) )) i p, j n and Ñp = np ( X p X p ni p ). sup x F Np (x) F Ñp (x) p rank(x p X p ) p p i= j= n η L(i,j) where η α = I( x α > ǫ p n /4 ). The first inequality above follows from Bai (999) [, Lemma 2.6]. Write q p = sup α P( x α > ǫ p n /4 ). We claim that there exists a sequence ǫ p 0 going to zero arbitrarily slowly such that q p ǫ p n (+/λ)+δ/8. (24) To establish the claim, for simplicity, assume n = n(p) is an increasing function of p. Fix any ǫ > 0. We have n (+/λ)+δ/8 sup α P( x α > ǫn /4 ) ǫ 4(+/λ) δ/2 sup E x α 4(+/λ)+δ/2 I( x α > ǫn /4 ) 0 (25) α since the random variables { x α 4(+/λ)+δ/2 } are uniformly integrable. Given m, by (25) find an integer p m such that n n m := n(p m ) implies n (+/λ)+δ/8 sup P( x α > m n /4 ) m. α Define ǫ p = /m if p m p < p m+ and ǫ p = n(p ) (+/λ)+δ/8 for p < p. Note that by choosing the integers in the sequence p < p 2 < as large as we want, we can make ǫ p go to zero as slowly as we like. Clearly, ǫ p satisfies the inequality (24). For any β > 0, and with Y i independent Bernoulli with E(Y i ) q p, P(sup x F Np (x) F Ñp (x) > β) P(p p i= j= n η L(i,j) > β) k p P(α p p Y i > β) i= k p (βp) α p EY i by Markov inequality i= (βp) α p k p q p Cnq p = o(n (/λ+δ/8) ) = o(p (+λδ/8) ) where the constant C is such that α p k p Cβnp. Hence by Borel-Cantelli lemma, sup F Np (x) F Ñp (x) 0 a.s. x Let ˆX p = ((ˆx L(i,j) )) i p, j n and ˆN p = np ( ˆX p ˆX p ni p ). 6

17 Using Lemma (i) and (ii), sup x F ˆN p (x) F Ñp (x) ( p p i= λ i ( ˆN p ) λ i (Ñp) ) 2 np 3 Tr( Xp X p + ˆX p ˆX p ) Tr ( E( Xp )(E( X p )). Using the moment condition, the condition on the truncation level, the condition α p k p = O(np), and an appropriate strong law of large numbers for independent random variables, it is easy to show that the above expression tends to 0 almost surely. We omit the tedious details. On the other hand, d 2 BL (F ˆN p,f Bp ) p Tr( ˆN p B p ) 2 2np 2 p ( n (ˆx 2 L(i,l) Eˆx2 L(i,l) )) 2 + 2np 2 i= l= Note that for every i, j, there exists α such that p ( n ( Eˆx 2 L(i,l) )) 2 = M + N say. 0 Eˆx 2 L(i,j) = E x2 αi( x α > ǫ p n /4 ) + (E x α I( x α > ǫ p n /4 )) 2 ǫ 2 pn /2(Ex4 α + E 2 x 2 α). (26) From (26), it is immediate that N n2 p ( sup 2np 2 nǫ 4 (Ex 4 α + E 2 x 2 α) ) 2 0 since ǫp p /4. p α Now we will deal with the first term, M. p ( n i= l=(ˆx 2 L(i,l) Eˆx2 L(i,l) ))2 = α where a α,b α,α 0. Obviously, i= l= a α (ˆx 2 α Eˆx2 α )2 + α =α b α,α (ˆx 2 α Eˆx2 α )(ˆx2 α Eˆx2 α ) = T p + T 2p (say) #{α Z : a α } k p and #{(α,α ) Z 2 : α α,b α,α } k 2 p. Also, a α α p for all α and b α,α Φ 2 α p for all α α. Hence, 4n 2 p 4 ET p 2 = p p 4n 2 p 4 a 2 αe(ˆx 2 α Eˆx 2 α) 4 + α p 4n 2 p 4 α =α a α a α E(ˆx 2 α Eˆx 2 α) 2 E(ˆx 2 α Eˆx2 α )2 p sup α p α 2 pk p 4n 2 p 4 sup E(ˆx 2 α Eˆx2 α )4 + α p α 2 pk p 4n 2 p 4(n/4 ǫ p ) 4 Ex 4 α + sup α p α 2 pkp 2 4n 2 p 4 sup E 2 (ˆx 2 α Eˆx2 α )2 α α 2 pk 2 p 4n 2 p 4 E2 x 4 α < where we use the fact that α p k p = O(np) and α p p. Thus, T p /(2np 2 ) 0 a.s. 7

18 It now remains to tackle T 2p. Let y α = ˆx 2 α Eˆx 2 α, α Z. Then {y α } are mean zero independent random variables. p 4n 2 p 4 ET 2 2p = p p 4n 2 p 4 E( α =α b α,α y α y α ) 2 4n 2 p 4 α =α b 2 α,α Ey2 α y2 α kp 2 sup α2 p α 4n 2 p 4 E2 x 2 α <. p Thus by Borel-Cantelli lemma, T 2p /(np 2 ) 0 almost surely and this completes the proof. Remark 2 (a) From results of Bai and Yin (988) [3], it is known that for the special case of sample covariance matrix, finite fourth moment is needed for the above approximation to work, and inter alia, for the existence of the LSD almost surely. Here the link function is more general, necessitating slightly higher moments. (b) If we carefully follow the above proof, finiteness of the 4( + /λ) + δ-th moment was only needed in proving that sup x F Np (x) F Ñp (x) a.s. 0. If we impose the weaker assumption sup α Ex 4 α <, then q p ǫ p /n for suitably chosen sequence {ǫ p } 0 and sup x F Np (x) F Ñp P (x) 0 holds and hence D(F Bp,F Np ) 0 in probability. Having approximated N p by B p, we now need to establish the behaviour of the moments of B p. This is done through a series of Lemma, finally leading to Theorem 8. In the subsequent discussion, we will use the following notation. Π (w) := {π Π(w) : π(2i 2) π(2i) i h}, Π (w) := {π Π (w) : π(2i 2) π(2i) i h}. Analogous to Π(w) and Π (w), we define, for several words w,w 2,...,w k, Π(w,w 2,...,w k ) := {(π,π 2,...,π k ) : w i [s] = w j [l] ξ πi (s) = ξ πj (l), i,j k}, Π (w,w 2,...,w k ) := {(π,π 2,...,π k ) : w i [k] = w j [l] ξ πi (k) = ξ πj (l), i,j k}. (Definition of ξ π has been given earlier.) Moreover, Π (w,w 2,...,w k ) := { (π,π 2,...,π k ) Π(w,w 2,...,w k ) : π i (0) π i (2),...,π i (2h 2) π i (2h), i k }. For Lemma 2-5, we will always assume that the link function L satisfies Assumption A(ii) and A. Lemma 2 Fix h and a matched word w of length 2h. Then we have where K h is some constant depending on h. w + +[ Π (w) K h p 2 ] n [ w 2 ], (27) 8

19 Proof of Lemma 2. Let k be the number of odd generating vertices, denoted by π(2i ),π(2i 2 ),...,π(2i k ) where i < i 2 <... < i k < h. We wish to emphasize that i k < h since {ξ π (2h ),ξ π (2h)} cannot be a matched edge as ξ π (2h ) ξ π (2h) by A. Recall the definition of Type I and Type II generating vertex given in Section 2. Let t be the number of Type I odd generating vertices. Since total number of generating vertices apart from π(0) is w, we have t [ w /2]. Now, fix a Type II generating vertex π(2i ). Also, suppose we have already made our choices for the vertices π(j), j < 2i which come before π(2i ). Since π(2i) is not a generating vertex, ξ π (2i) = ξ π (j) for some j < 2i. (28) (Note that since π(2i 2) π(2i), we cannot have ξ π (2i) = ξ π (2i ) by Assumption A ). Now that the value of ξ π (j) has been fixed, for each value of π(2i), there can be at most Φ many choices for π(2i ) such that (28) is satisfied. Thus, we can only have at most pφ many choices for the generating vertex π(2i ). In short, there are only O(p) possibilities for a Type II odd generating vertex to choose from. With the above crucial observation before us, it is now easy to conclude #even generating vertices+#type II vertices Π (w) = O(p n #Type I vertices w + +[ ) = O(p 2 ] n [ w 2 ] ). Lemma 3 (i) For every h even, p E Tr B h p w matched, w =h p h/2 n h/2 Π (w) 0. (ii) For every h odd, lim p p E TrB h p = 0. Proof of Lemma 3. Let ˆX π be as defined in (4) with x α replaced by ˆx α. From the fact that Eˆx α = 0 and B ii = 0 for all i, we have p E Tr Bp h = p (+h/2) n h/2 w matched EˆX π. π Π(w), π(2i 2) π(2i), i h Fix a matched word w of length 2h. It induces a partition on 2h L-values ξ π (),ξ π (2),ξ π (3),...,ξ π (2h) resulting in w many groups (partition blocks) where the values of ξ π within a group are same but across the group they are different. Let C k be the number of groups of size k. Clearly, C 2 + C C 2h = w and 2C 2 + 3C hC 2h = 2h. Note that Thus if π Π(w), sup α Eˆx 2 α = o() and sup E ˆx α k (ǫ p n /4 ) k 2 k 2. α E ˆx ξπ()ˆx ξπ(2)... ˆx ξπ(2h) (ǫ p n /4 ) 0.C 2+.C (2h 2).C 2h (ǫ p n /4 ) 2h 2 w. 9

20 Using Lemma 2, π Π(w), π(2i 2) π(2i) i h ( E ˆx ξπ()ˆx ξπ(2)... ˆx ξπ(2h) (ǫ p n /4 ) 2h 2 w w + +[ K h p 2 ] n [ w ]) 2 Case I. Either w < h or w = h with h odd. Then p +h/2 n h/2 Case II. If w = h with h even, then lim p p +h/2 n h/2 π Π(w), π(2i 2) π(2i) i h π Π(w), π(2i 2) π(2i) i h = (ǫ p ) 2h 2 w n h/2 ( w /2 [ w /2]) w + +[ p 2 ]. E ˆx ξπ()ˆx ξπ(2)... ˆx ξπ(2h) 0. (29) Eˆx ξπ()ˆx ξπ(2)... ˆx ξπ(2h) = lim p p +h/2 n h/2 Π (w) ( + o()) h = lim p p +h/2 n h/2 Π (w) = lim p p +h/2 Π nh/2 (w). This completes the proof. Fix jointly matched and cross-matched words (w,w 2,w 3,w 4 ) of length (2h) each. Let κ = total number of distinct letters in w,w 2,w 3 and w 4. (30) Lemma 4 For some constants C h depending on h, # { (π,π 2,π 3,π 4 ) Π(w,w 2,w 3,w 4 ) : π i (0) π i (2),...,π i (2h 2) π i (2h),i =,2,3,4 } { Ch p 2+2h n 2h if κ = 4h or 4h +κ 4+[ C h p 2 ] n [ κ 2 ] if κ 4h 2. (3) Proof of Lemma 4. Case I. κ = 4h or 4h. Since the circuits are cross-matched, upon reordering the circuits, making circular shift and counting anti-clockwise if necessary, we may assume, without loss of generality, ξ π (2h) does not match with any L-value in π when κ = 4h. Because of cross matching, when κ = 4h, we may further assume that ξ π2 (2h) does not match with any L-value in π or π 2. We first fix the values of the p-vertices π i (0), i 4, all of which are even generating vertices. Then we scan all the vertices from left to right, one circuit after another. We will then, as argued in Bryc, Dembo and Jiang (2006) [7], obtain a total (4h + 2) generating vertices instead of (4h + 4) generating vertices which we would have obtained by usual counting. In our dynamic counting, we scan the L-values in the following order: ξ π (),ξ π (2),ξ π (3),...,ξ π (2h),ξ π2 (),ξ π2 (2),...,ξ π2 (2h),ξ π3 (),...,ξ π4 (2h). 20

21 From the arguments given in Lemma 2, it is clear that for an odd vertex π i (2j ) to be Type I, both ξ πi (2j ) and ξ πi (2j) have to be the first appearances of two distinct L values. So, total number of Type I n-generating vertices is at most κ/2. Case II. κ 4h 2. We again apply the crucial fact that for an odd vertex π i (2j ) to be Type I, we need both ξ πj (2j ) and ξ πj (2j) to be the first appearances of two distinct L-values, while the circuits being scanned from left to right and one after another. Since there are exactly κ distinct L-values, the number of Type I odd vertices is not more than κ/2. Combining this with the fact that total number of generating vertices equals (κ + 4), we get the required bound. Lemma 5 For each fixed h, E [ p (Tr Bp h E(Tr Bh p ))] 4 <. p= Proof of Lemma 5. As argued earlier and from the fact that the diagonal elements of B p are all zero, we have E [ p (Tr Bp h E(Tr Bp)) h ] 4 = p 4 2h n 2h ( 4 E (ˆX πi EˆX πi ) ) where the outer sum is over all quadruples of words (w,w 2,w 3,w 4 ), each of length 2h and which are jointly matched and cross-matched. The inner sum is over all quadruples of circuits { (π,π 2,π 3,π 4 ) Π(w,w 2,w 3,w 4 ) : π i (0) π i (2),...,π i (2h 2) π i (2h),i =,2,3,4 }. Note that by definition of κ, κ 4h for any jointly matched quadruple of words (w,w 2,w 3,w 4 ) of total length 8h. Fix w,w 2,w 3,w 4, jointly matched and cross-matched. Case I. κ = 4h or 4h. Then the maximum power with which any ˆx α can occur in 4 ˆX i= πi is bounded by 4. But since sup α E(ˆx α ) 4 <, we immediately have E ( 4 i= (ˆX πi EˆX πi ) ) <. Thus, by Lemma 4, i= p 4 2h n 2h : κ {4h,4h} ( 4 E (ˆX πi EˆX πi ) ) = p 4 2h n 2h O(p 2+2h n 2h ) = O(p 2 ). i= Case II. Suppose now that κ = 4h k, k 2. Borrowing notation from Lemma 3, we have These two equations immediately give It is also easy to see that C 2 + C C 8h = κ and 2C 2 + 3C hC 8h = 8h. C 3 + 2C (8h 2)C 8h = 8h 2κ = 2k. C 2k+2+i = 0 i and C 5 + 2C (2k 2)C 2k+2 2k 2. Since the input variables are truncated at ǫ p n /4 and since sup α E(x 4 α) <, sup α E 4 i= ˆX πi E(ˆx α ) 4 n C 5 +2C (2k 2)C 2k+2 4 = O(n 2k 2 4 ). 2

22 Thus, by Lemma 4, for any k 2, p 4 2h n 2h : κ {4h k} ( 4 E (ˆX πi EˆX πi ) ) p 4 2h n 2h C h p i= 4h k+ 4+[ 2 ] n [ 4h k 2 ] O(n k 2 ) = O(p 4 2h p 2h 4h k+ 4+[ p 2 ] p [ 4h k 2 ] O(p k 2 ) = O(p 4 4h k (4+4h k)+ p 2 ) = O(p ( k 2 +) ). For a quick explanation of the first equality above, just note that total power of n in the previous expression is negative. We are now ready to summarize the results of Theorem 7 and Lemma 2 5 in the following theorem in Regime II. Theorem 8 Let p,n so that p/n 0 and Assumption R2 holds. Suppose L p satisfies Assumptions A(ii), A and B. Suppose {ǫ p } satisfying {ǫ p } 0 and ǫ p p /4 is appropriately chosen. Then (i) For every h even, p E TrB h p w matched, w =h (ii) For every h odd, lim p p E TrB h p = 0. (iii) For each h, p TrB h p Ep TrB h p a.s. 0. p (+h/2) n h/2 Π (w) 0. (iv) β 2h lim sup p w matched p (+h) n h Π (w), satisfies Carleman s condition. w =2h As a consequence, the sequence {F Np } is almost surely tight. Every subsequential limit is symmetric and sub Gaussian. The LSD of {F Np } exists almost surely, iff lim w matched p (+h) n h Π (w) exists. These give the (2h)th moment of the LSD. w =2h Below we will deal with two matrices with different link functions L () or L (2). The corresponding relevant quantities will now be denoted with added superscripts () and (2) respectively. Theorem 9 Let L () and L (2) be two link functions satisfying Assumptions A(ii), A and B and agreeing on the set {(i,j) : i p, j n, i < j}. Then, for each matched word w of length (4h) with w = 2h, Π () p h+ n h (w) Π (2) (w) 0. Hence, in Regime II under Assumption R2, F N(i) p,i =,2 have identical asymptotic behaviour. Proof of Theorem 9. Define for each link function L (i), i =,2 Γ (i) j Now, it is enough to prove that for a fixed j, := {π Π (i) (w) : π(2j + ) p}, j =,2,...,2h. p h+ n h Γ(i) j 0. 22