Ergodic Theory and Expanding Maps on the Circle

Transcription

1 University of Helsinki Bachelor s thesis Ergodic Theory and Expanding Maps on the Circle Author: Henri Sulku henri.sulku@helsinki.fi Supervisor: Dr. Mikko Stenlund September 17, 2012

2 Contents 1 Introduction 1 2 Basics of ergodic theory Measure preserving transformations Ergodicity Ergodic theorems Mixing Coupling in probability theory 12 4 Uniformly expanding maps on the circle Preliminaries Expansion properties and inverse branches Transfer operator Existence of a.c.i.m Boundedness Compactness Exponential decay of correlation coefficients Preliminary results Coupling Mixing Conclusions 32 A Appendix 33 i

3 1 Introduction Ergodic theory is a branch of mathematics that studies asymptotic and statistical properties of dynamical systems that are allowed to evolve for a sufficiently long time. The term was introduced by Ludvig von Boltzmann ( ) who also had a great influence on the early development of the theory. Actually, one can say that initially ergodic theory was nothing but the study of the ergodic hypothesis and ergodic problem, both originating from Boltzmann s works [1, p ]. Since major part of the research on ergodic theory has always been influenced by statistical physics, we review the very basics of classical mechanics. From the viewpoint of classical mechanics, the state of the system consisting of N particles is completely determined by the positions and momenta of the particles. In 3- dimensional world there are 6N defining parameters 1, called degrees of freedom. Therefore, every state can be represented as a vector in the 6N-dimensional euclidean space called phase space or configuration space. The descriptive word dynamical means that there is some set of equations, called equations of motion, which determines how the system evolves in time. When these equations of motion are known, it is possible to attach a mapping T t : to every t R + such that T t (x) is the state of the system at time t if it was at the state x at time 0 2. The trajectory of some initial state x is therefore {T t (x) : t R + }. The problem is that usually it is nearly impossible to solve these equations of motion and even though those equations could be solved, the mappings T t would be too tricky to work with. Therefore, we need some simplified model of the dynamics. If the governing equations of motion do not change in time, it is evident that T t T s = T t+s holds for every t, s R +. By iterating this property, we deduce that T nt0 = Tt n 0 = T t0 T t0 T t0 for every natural number n. Therefore, we can simplify the dynamics by discretizing the time. Presumably, T n t 0 and T t have similar asymptotic characteristics. Thus, we can assume that the system evolves in discrete time steps. Furthermore, the evolution of the system is obtained by iterating the single mapping T := T t0. If the system is now at state x, it will be at state T (x) at the next moment and T n (x) after n time steps. We see that also dynamics have been simplified since we do not have to work with family of mappings anymore. So far we have presented how a single state evolves in time. However, the more interesting scenario is the one in which we are given a probability distribution µ of states. This is also a bit more realistic picture since the initial state is never known with full certainty. If the system is in dynamical equilibrium, we expect that the distribution does not vary in time. Therefore, we are particularly interested in mappings which satisfy µ(a) = µ(t 1 A), i.e., mappings which preserve the measure µ. Roughly speaking, this is the very basic setup from which ergodic theory comes from. In this framework, Boltzmann also conjectured the following [2, p. 78]: 1 Every particle has 3 position coordinates and 3 momentum coordinates and there are N particles. 2 For simplicity, we are assuming here that the equations of motion have a unique solution for all times t, for all initial conditions x. 1

4 If a system is sufficiently large, then the average of an observable 3 f along the trajectory of the system approaches the expectated value of f with respect to the equilibrium distribution µ. Expressed in a modern mathematical language, conjecture states that for every initial state x 1 T lim f(t t (x))dt = f dµ, (1) T T 0 or in discrete time: N 1 1 lim f(t n (x)) = f dµ. (2) N N n=0 This is, of course, a nontrivial statement since, for example, the left-hand side seems to depend on the initial state x unlike the right-hand side. The justification of these equalities is the ergodic problem mentioned at the beginning and the study of these equalities dominated the field until 1930s. It was soon realised that all initial states, those with periodic trajectory for example, do not satisfy the claim. In order to fix this incongruity, the public opinion among leading researchers was that sufficiently large physical system cannot stay in a periodic orbit because the motion is disturbed by external forces and irregular collisions of the particles. These considerations led to the acceptance of the following ergodic hypothesis [2, p. 77&80]. Ergodic hypothesis 1. Apart from some negligible set of initial states, the trajectory of a physical system visits all states compatible with the total energy of its initial state. The modern form of the hypothesis stems from Henri Poincaré ( ) [2, p. 83]: Ergodic hypothesis 2. Apart from a set of states that has zero probability, the trajectory of a physical system is dense in the set of all states compatible with the total energy of the initial state. The idea behind this hypothesis, that systems can wander freely in the allowed part of the phase space, is the cornerstone of ergodicity and this, combined with measure preserving property of the dynamics, results in highly nontrivial consequences. The work of George Birkhoff ( ) and John von Neumann ( ) finally concluded the study of the ergodic problem in the early 1930s giving positive answer to Boltzmann s conjecture. However, this was not the completion of ergodic theory but just the opposite. Since 1930s ergodic theory has developed quite a lot giving rise to more sophisticated and detailed concepts of measure preserving dynamical systems. Nowadays, ergodic theory is a major branch of the study of a wide range of dynamical systems as well as stochastic processes. Many of the applications are naturally related to mathematical physics but there are also examples in which ergodic theory is applied to purer mathematics such as number theory and functional analysis. Structure of the Thesis. The rest of the treatise consists of two parts. In section 2, we present the basic concepts and core results of the ergodic theory of measure preserving systems. The results presented 3 Observable is measurable physical quantity, pressure for example. 2

5 here belong to the classical era of the theory; even the most recent results are from 1930s. Therefore, this presentation does not give a good picture of the modern state of the field. The relevant concepts and results that are needed in the latter part of the treatise are presented an proved here. Thus, readers not familiar with the theory can also get acquainted with the latter, more technical part. In Section 3, we present a concise introduction to the usage of the coupling method. Coupling, arising from the theory of Markov processes, is a more dynamical alternative to the commonly used functional analytic techniques for establishing convergence results. In the latter part, that is, in Section 4, we analyse a class of widely studied dynamical systems that is, expanding circle mappings which possess typical properties of the measure preserving systems. The main results we achieve for C 2 mappings are the existence and uniqueness of a mixing measure equivalent to the Lebesgue measure and the exponential convergence of Hölder continuous densities towards the invariant density (which is actually Lipschitz continuous). Exponential convergence is obtained using the coupling method introduced in Section 3. Moreover, we prove that mixing follows from the convergence of densities and it is exponential for Hölder continuous observables. 2 Basics of ergodic theory Some aspects of the ergodic theory can be generalised to infinite measure spaces but in this chapter we focus on spaces of finite measure. Especially, it is customary to restrict oneself to probability spaces since then one can exploit the tools and intuitiveness of probability theory. Furthermore, working with probability spaces leads to very natural probabilistic interpretations of the results. This is, of course, just a minor restriction since finite measures can always be scaled to probability measures: µ µ/µ(). A more comprehensive introduction to ergodic theory can be found in [3] or [4]. 2.1 Measure preserving transformations As we mentioned in the introduction, the basic setup of ergodic theory is the measure preserving transformation (mpt) or measure preserving dynamical system which is formalised in the following definition. Definition 2.1 (Measure preserving transformation). A measure preserving transformation is a quartet (, Γ, µ, T ), where (, Γ, µ) is a measure space and T : is a mapping such that 1. T is (Γ, Γ)-measurable: A Γ = T 1 A Γ. 2. µ is T -invariant: µ(a) = µ(t 1 A) for all A Γ. A mpt is called probability preserving transformation (ppt) if (, Γ, µ) is a probability space. Additionally, it is worth noticing that the equality µ(t n A) = µ(a) is obtained by iterating the second property in the definition of mpt. A common problem in the theory of measure preserving dynamical systems is to prove the existence and uniqueness of a T -invariant measure with predefined measurable sets Γ and dynamics T. In order to achieve uniqueness, one usually demands some additional requirements such as absolute continuity with respect to some other measure to be fulfilled. 3

6 In the language of mechanics, this means that we are willing to prove the existence of a stationary distribution of desired type when we know the dynamics as well as measurable events. From the viewpoint of physics, this is indeed the central concern. Therefore, it is worthwhile to characterise the invariance of a measure in such a way that we can determine whether or not a given measure is suitable. Theorem 2.1. Let, T and Γ be given. The following are equivalent 1. µ is T -invariant. 2. fdµ = f T dµ for all f L1 (, Γ, µ). 3. fdµ = f T dµ for all f Lp (, Γ, µ) for some p fdµ = f T dµ for all f C() if Γ = B = Borel σ-algebra. Proof. We prove only the equivalence of 1 and 2 since the rest of the theorem is either similar or simple approximation arguments. In order to prove the direction 1 2, let χ A be an indicator of a measurable set A. Then χ A dµ = µ(a) = µ(t 1 A) = χ T 1 Adµ = χ A T dµ and the claim holds for indicators. Therefore, the claim holds for simple functions by linearity. If f L 1 is positive, there is an increasing sequence (f n ) n of simple functions such that f n f. By direct calculation f n T f T and fdµ = lim f n dµ = lim f n T dµ = f T dµ n n holds by monotone convergence theorem. For general f L 1, the claim follows by applying the preceding argument to positive and negative parts separately. The opposite direction is obtained easily since χ A L 1 for every A Γ by finiteness of the measure space and µ(a) = χ A dµ = χ A T dµ = χ T 1 Adµ = µ(t 1 A) holds by assumption. Remark 2.1. Minor modifications in the proof show that the mapping f f T is an isometry from L p into itself if T is measure preserving. In order to illustrate the use of the theorem, let us concider the canonical example of circle rotations. Example 2.1. Let be the unit circle, i.e., [0, 1] with endpoints identified, B the Borel σ-algebra and define T α : by T α (x) = α + x mod 1 for all x [0, 1], where α [0, 1). We claim that the uniform measure (the arc length measure) for which µ(i) = length of I (3) 4

7 holds for every interval I is T -invariant 4. We show that 3 holds in the previous theorem with p = 2. To this end, let f L 2 be arbitrary. Using the fact that f equals its Fourier series (in L 2 sense) we have f(x) = n= a n e 2nπix. By Remark 2.1, this holds also for f T and we have f T (x) = n= a n e 2nπi(x+α) =: n= b n e 2nπix Comparing terms, we see that a 0 = b 0, i.e., fdµ = f T dµ which finishes the proof. The probability preserving nature of the dynamics includes very useful information. Actually, Definition 2.1 of ppt results in the first nontrivial ergodic theorem, that is, highly celebrated Poincaré recurrence theorem. Especially, the theorem of Poincaré showed that the definition of the measure preserving system is a good basis for the study of dynamical systems. Besides, the conclusion is physically, as well as philosophically, very fascinating. Theorem 2.2 (Poincaré recurrence theorem). Let (, Γ, µ, T ) be ppt and A. For almost every x A there is n N such T n (x) A. Proof. Let F = {x A : T n (x) / A n N}, then it suffices to show that µ(f ) = 0. First, T n F T m F = when n m. If this is not the case and z T n F T m F, then (assume n > m) T m (z) F and T n m (T m (z)) = T n (z) F A which is a contradiction by the definition of the set F. By the σ-additivity and invariance of the measure µ(f ) = n=0 µ(t n F ) = µ( T n F ) µ() = 1 <. n=0 The proof is finished by noticing that this is possible if and only if µ(f ) = 0. Actually, it is not too difficult to generalise the result such that T n (z) A for infinitely many n. However, the remarkable conclusion is apparent even in the weaker form: The system returns arbitrarily close to its initial state if it is allowed to run for a long time. This is indeed an unintuitive and even paradoxical result if not further analysed. This time, however, the rigorous solution coincides with the natural proposition: for neighbourhoods of extremely small probability the return time becomes extremely long so long that a return might never be observed in the real world. In order to prove this, we need to introduce the concept of ergodicity. 2.2 Ergodicity As we motivated in the Introduction, ergodicity is based on the idea that a system can wander freely in the governing phase space. The following definition expresses this by means of measure theory. We say that a set E is invariant if T 1 E = E. 4 Remark that arcs and their unions form an algebra which generates the Borel σ-algebra of the circle. Therefore, equation (3) defines a unique measure on the Borel σ-algebra by Carathèodory s extension theorem. n=0 5

8 Definition 2.2. A mpt (, Γ, µ, T ) is ergodic if µ(e) = 0 or µ(e c ) = 0 holds for every invariant set E. In metric spaces, or more generally in second countable topological spaces, µ being positive on open nonempty sets coincides with the requirement that almost all orbits are dense [5, p. 26]. In a sense, we have therefore constructed a measure theoretical counterpart of the topological requirement in the ergodic hypothesis. Usually, when T (or µ) is given, one loosely says that µ (or T ) is ergodic even though ergodicity is a property of the system (, Γ, µ, T ). More characterisations are obtained using ergodic theorems but we present one which follows straightforward from the definition. Invariant functions are those for which f(x) = f T (x) for almost every x. Theorem 2.3. A mpt (, Γ, µ, T ) is ergodic if and only if every T -invariant L 2 -function is constant µ almost everywhere. Proof. Let us first assume that system is ergodic and f T = f. The sets A t = {x : f(x) > t} are invariant due to the invariance of f. Consequently, there is c R such that µ({x : f(x) c}) = 0. More precisely, if f is a.s. constant (e.g. c), then µ(a t ) = 0 for all t c and µ(a t ) = 1 for all t < c. On the other hand, if f is not a.s. constant, then there are c 1 and c 2 (> c 1 ) such that µ(f 1 {c i }) (0, 1) for i = 1, 2. Consequently, µ(a t ) (0, 1) for all t (c 1, c 2 ) which contradicts the ergodicity assumption. On the other hand, if A is an invariant set, its characteristic function satisfies χ A (T (x)) = χ T 1 A(x) = χ A (x). If invariant functions are constants, χ A (x) = 0 µ a.e. or χ A (x) = 1 µ a.e. and therefore µ(a) = 1 or µ(a) = 0. Since A was arbitrary invariant set, this proves the ergodicity. The usefulness of the previous theorem is illustrated in the following example which shows that the dynamics of the previously studied circle rotations depend considerably on the rotation number α. Example 2.2 (Sequel to example 2.1). (, B, µ, T α ) is ergodic if and only if α / Q. Let us first assume that α Q, i.e., α = p/q for some p Z, q N. Define A n = [ n q, n q + 1 ], (4) 2q where 0 n q 1. Obviously these sets are disjoint, T 1 p/q A n = A n 1 for every 0 < n q 1 and T 1 p/q A 0 = A q 1. Therefore, A = q 1 n=0a n is invariant and Consequently, T p/q is not ergodic. q 1 1 µ(a) = µ(a n ) = q 2q = 1/2. n=0 6

9 Let us then assume that a / Q and f L 2 is invariant. Being an L 2 function, f has a Fourier series representation and uniqueness is therefore equivalent to k= a k e 2πikx = f(x) = f(t α (x)) = Due to uniqueness of the coefficients, k= a k e 2πik(x+α) = k= a k e 2πikα e 2πikx. a k = a k e 2πikα k Z. (5) e 2πikα 1 whenever k 0 because α / Q by assumption. Thus, a k = a 0 δ 0,k and f a 0 = constant. Therefore, the system is ergodic by Theorem 2.3. We finish this section by proving the statement at the end of the previous one, i.e., that the return time is inversely proportional to the probability of a set. To this end, let us denote the first return time of a point x by n A (x). More precisely, n A : A N { } such that T n A(x) (x) A and T n (x) / A for all 0 n < n A (x). Remark that n A is almost everywhere finite by Poincaré s theorem. The proof of the theorem, as well as that of Poincaré s theorem, are basically from the book of M. Yuri & M. Pollicot [3, p ]. Theorem 2.4 (Kac s recurrence theorem). Let (, Γ, µ, T ) be ergodic and µ(a) > 0. The expected return time (with respect to the induced probability measure µ A 5 ) is A n A (x) dµ A (x) = 1 µ(a). (6) Proof. By definition of induced probability it is equivivalent to show that A n A dµ = 1. Let us start by defining the following sets. 1. A n = {x A : n(x) = n} for each n 1 (and n = ). The sets A n are disjoint and A = ( n A n ) A. In particular, n N µ(a n) = µ(a). 2. B n = {x : T i (x) / A for 1 i n 1, T n (x) A} for each n 1. The sets B n are disjoint also and by ergodicity n µ(b n) = 1 (since n B n is invariant set with positive measure). We can write n A (x) dµ(x) = A kµ(a k ) = k=1 k µ(a k ) = k=1 n=1 µ(a n ) k=1 n=k by Fubini s theorem. complete the proof. When k = 1, B 1 = T 1 A by definition. Thus, Therefore, if we can show that n=k µ(a n) = µ(b k ) this will µ(a n ) = µ(a) = µ(t 1 A) = µ(b 1 ) n=1 5 That is, µ A (B) = µ(b)/µ(a) for every measurable B A. 7

10 and the claim holds. For k > 1 we will proceed by induction. Using the definition of the sets B j and A j, we can partition T 1 B k = ( T 1 B k T 1 A ) ( T 1 B k ( \ T 1 A) ) = B k+1 T 1 A k. } {{ } } {{ } T 1 A k B k+1 Therefore, µ(b k ) = µ(t 1 B k ) = µ(b k+1 ) + µ(t 1 A k ) = µ(b k+1 ) + µ(a k ). Rearrangement and the induction assumption result in µ(b k+1 ) = n=k+1 µ(a n) which completes the inductive step and the proof as well. 2.3 Ergodic theorems The following theorems of von Neumann and Birkhoff concluded the study of Boltzmann s ergodic problem (see Introduction). The result turns out to be positive for a large class of observables. More precisely, these theorems specify the form of the convergence in (2) and the class of observables for which this holds. von Neumann s theorem also results in two characterisations of ergodicity which imply that ergodicity is a special case of a stronger property called mixing. The idea of the following proof is basically from [6, p ]. Theorem 2.5 (von Neumann ergodic theorem, mean ergodic theorem). Let (, Γ, µ, T ) be a ppt. For every f in L 2 (µ) there is f L 2 (µ) such that f is invariant and 1 N f T n (x) N f 0. (7) n=1 Additionally, if the system is ergodic f = f dµ. L 2 (µ) Proof. Recall that f T L 2 (µ) = f L 2 (µ) for all f L 2 by the invariance of µ. Let us define C and I L 2 : If f C i.e. f = g g T, then 1 N N n=0 C = {g g T : g L 2 } (8) I = {f L 2 : f T = f}. (9) f T n L 2 = 1 N g T N g L 2 2 g L 2 N 0. (10) N Thus, the claim holds for all f C with f = 0. Additionally, for every f C and ɛ > 0 there is F C such that F f L 2 < ɛ. Pick N 0 such that N > N 0 1 N 1 N n=0 F T n L 2 < ɛ, then for all N > N 0 1 N 1 f T n N L 2 1 N n=0 N 1 (f F ) T n L N n=0 1 N 1 (f F ) T n L 2 + ɛ N n=0 N 1 n=0 F T n L 2 2ɛ. (11) 8

11 This shows that the claim holds for f C. If f I, the claim holds trivially. If we can show that C = I this would finish the proof since L 2 = C C = C I. For the inclusion I C, suppose that f I and that h n = g n g n T C is a sequence converging to an element h C in L 2, f, h n = f, g n f, g n T = f, g n f T, g n T = f, g n f, g n = 0 by the invariance of the measure. Taking the limit, f, h = 0, which yields the claim. If f C, then f f T 2 L 2 = f L 2 2 f, f T + f T 2 L 2 = 2 f L 2 2 f, f (f f T ) = 2 f L 2 2 f L 2 = 0, i.e. f I and the first part of the theorem is proved. The proof shows that f is the projection of f on the space of invariant functions which L consists of constants in ergodic case. Additionally, since f 1 L n f follows from f 2 n f, this results in f = f dµ = lim N N 1 1 N n=0 f T dµ = lim f dµ = f dµ. N Remark that the proof uses only the fact that U: Uf = f T is an isometry of L 2. Actually, the theorem holds for all contractions of general Hilbert spaces. The next theorem, Birkhoff s theorem, is probably the most celebrated result in classical ergodic theory. Theorem 2.6 (Birkhoff s (pointwise) ergodic theorem). Let (, Γ, µ, T ) be a ppt and f L 1 (, Γ, µ). For µ almost all x 1 N N n=0 f T n (x) N E(f I)(x). (12) The right hand side is the conditional expectation of f with respect to the σ-algebra I = {A Γ : T 1 A = A}. Particularly, if the system is ergodic, then E(f I) is constant almost surely and equals f dµ. Since the proof of the theorem is rather lengthy, we must refer to [4, p. 34 & 37-39] in order to maintain the readability of the treatise. Remark, that E(f I) is T -invariant even though the system is not ergodic but only measure preserving. Birkhoff s theorem results in a very central theorem concerning the structure of ergodic measures. Theorem 2.7. Ergodic probability measures are mutually singular. 9

12 Proof. Assume both µ and ν are ergodic. Pick f L 1 (µ) L 1 (ν) such that f dµ f dν. Then, by Birkhoff s theorem, for µ-a.e. x N 1 1 f T n (x) N n=1 f dµ and for ν-a.e. x N 1 1 f T n (x) N n=1 f dν. Since both limits cannot exist simultaneously, we deduce that there are sets A, B such that A B =, µ(a) = 1 and ν(b) = 1 which, in turn, implies that the measures are singular. Assuming Theorem 2.6 it is not too difficult to prove that ergodicity is actually equivalent to the statement 1 N N n=0 f T n (x) f dµ f L N 1 (µ). Additionally, using von Neumann ergodic theorem it is immediate that the following are equivalent: (1) T is ergodic. (2) 1 N lim n N 1 (3) lim n N n=0 f T n g dµ = f dµ g dµ f, g L 2 (µ). N µ(t n A B) = µ(a)µ(b) A, B Γ. n=0 (1) = (2) follows straightforward from the continuity of the scalar product of L 2. (2) = (3) is obtained by choosing f = χ A, g = χ B. (3) = (1) is obtained using (3) with B = A for invariant sets A. From purely analytical point of view, this means that ergodicity is equivalent to Cesàro convergence of µ(t n A B) towards µ(a)µ(b). Therefore, the classical result of Cesàro that standard convergence is stronger than Cesàro convergence, i.e., lim a 1 n = a = lim n n N N a n = a, implies that we could require standard convergence instead of Cesàro convergence. From probabilistic point of view, the result says that all measurable events A, B are asymptotically independent on the average. This, of course, implies that we could require asymptotic independence itself instead of independence on the average. Both ways, however, results in the concept of mixing. 10 n=0

13 2.4 Mixing Motivated by previous section, we begin by defining mixing. Definition 2.3. A mpt (, Γ, µ, T ) is mixing if for every A, B Γ µ(t n A B) n µ(a)µ(b). (13) Mixing is, of course, stronger than ergodicity by Cesàro s theorem but we can prove this even more straightforward. Let A Γ be an invariant set. By definition of mixing, µ(a) = µ(t n A A) µ(a)µ(a) = µ(a) 2. Since µ(a) = µ(a) 2, i.e. µ(a) {0, 1}, holds for every invariant A Γ, the system is ergodic. It is worth noticing that, by Carathèodory s extension theorem, one doesn t have to check that (13) holds for every A, B Γ. Indeed, it is enough to check that it holds for every A, B in the algebra which generates the governing σ-algebra. This is quite useful since the generating algebra is usually considerably smaller than the whole σ-algebra. Additionally, sets in the generating algebra are often the simplest ones, intervals for example. In order to give an example of mixing dynamics, we introduce the doubling map. Actually, this is just a special case of more general results we prove in the fourth chapter. Example 2.3. Let be the unit circle, i.e. [0, 1] with endpoints identified. Let us show that the dynamical system (, B, µ, T ) associated with the doubling map T :, T (x) = 2x mod 1 is mixing. Here B is the Borel σ-algebra and µ is the arc length measure as in the previous examples. Since dyadic intervals, i.e. intervals of the form [l/2 n, (l + 1)/2 n ] with some n N, 0 l < 2 n, generate the Borel σ-algebra, it s enough to check that (13) holds for them. To this end, let A, B be dyadic intervals, i.e., A = [ l 2 p, l p ] and B = [ k 2 q, k q ] with p, q N and l < 2 p, k < 2 q. It is easy to check that T 1 l (A) = [ 2, l + 1 p+1 2 ] [ l p p+1 2, l p+1 2 ] holds for A. Thus, the preimage of A consists of 2 intervals of length 1/2 p+1 spaced by 1/2. Similarly, by induction one can show that T n (A) consists of 2 n intervals of length 2 p+n spaced by 1/2 n. Thus, if n > q, the intersection T n (A) B is nonempty and it consists of those intervals of T n (A) that are contained in B. Additionally, the number of intervals in T n (A) B is given by the length of B divided by the spacing. Therefore, the number of intervals is Since each interval has length 1/2 p+n, µ(b) length of spacing = 1/2q 1/2 n = 2n q. µ(t n (A) B) = 1 2 p+n 2n q = 1 2 p 1 2 q = µ(a)µ(b). Taking limit is trivial since this holds for every n > j and therefore the system is mixing. 11

14 Mixing can be expressed in terms of observables as well. Actually, we can define mixing using asymptotic independence of observables. Theorem 2.8. A mpt (, µ, Γ, T ) is mixing if and only if f T n g dµ f dµ g dµ (14) n for all f, g L 2 (µ). Proof. If χ A and χ B are indicators of measurable sets A and B respectively, then χ A T n χ B dµ = µ(t n A B) µ(a)µ(b) = χ A dµ χ B dµ by the definition of mixing. By linearity, the theorem holds for simple functions as well. For positive L 2 functions, use the fact that simple functions are dense in the set of positive L 2 -functions. More precisely, for every positive f, g L 2 and ɛ > 0 there are simple f, f such that f f L 2 < ɛ and f f L 2 < ɛ. Thus, f T n g dµ f dµ g dµ = (f f) T n g dµ + f T n (g g) dµ + f T n g dµ g dµ f dµ + f dµ (g g) dµ + ( f f) dµ g dµ g dµ f dµ f T n g dµ + ɛ(2 g L 2 + f L 2) ɛ(2 g L 2 + f L 2 + 1), when n is large enough. Since ɛ > 0 was arbitrary, this proves the claim. The validity of the theorem for general L 2 -functions is obtained by splitting them in positive and negative parts. The other direction is obtained easily by choosing f = χ A and g = χ B. Although mixing may seem a rather artificial, albeit naturally deduced modification of ergodicity, afterwards one could say that mixing is the more natural concept to work with. Futhermore, mixing (contrary to ergodicity) is stringent enough to result in fascinating statistical properties. We conclude this subsection by mentioning that there is a property called weak mixing which is stronger than ergodicity but weaker than mixing. Its chracterised by 1 N N n=1 µ(t n (A) B) µ(a)µ(b) n 0 A, B Γ. 3 Coupling in probability theory In the previous section we set up the basic framework of ergodic theory. That is, we illustrated what interesting properties the systems might possess and what might happen 12

15 if they are, for example, mixing. Frequently, the actual task is to deduce whether or not the system actually is mixing or ergodic. We showed few examples in which this was confirmed easily but the more sophisticated the model, the more efficient tools one needs to establish the same results. One plausible approach is coupling, a probabilistic concept which we briefly introduce in this section and successfully use in Section 4. A more comprehensive, standard introduction to coupling is Torgny Lindvall s [7]. Suppose we are given two random variables and Y taking values in some measurable spaces and Y respectively. In probability theory, a coupling of and Y is a random variable Z = (, Y ) taking values in Y such that the marginals and Y have the same distributions as and Y, respectively. We can think of Z as representing and Y on the same probability space. A trivial coupling is obtained from the product distribution of and Y : One constructs Z on a probability space (Ω, F, P) in such a way that P( A, Y B) = µ (A)µ Y (B) for all measurable A and B Y, where µ is the law of and µ Y that of Y. In this case and Y are independent. More interesting couplings are obtained by making and Y dependent, but why are couplings worth studying in the first place? Let µ and ν be probability measures on the same probability space. The total variation distance of µ and ν is, by definition, d TV (µ, ν) = sup µ(a) ν(a). (15) A Additionally, if µ and ν are absolutely continuous w.r.t. a positive measure λ, the total variation distance has an equivalent expression d TV (µ, ν) = 1 ψ φ dλ, 2 where ψ and φ are the densities (Radon-Nikodym derivatives) w.r.t. λ. measure spaces, we can choose λ to be the counting measure and obtain d TV (µ, ν) = 1 µ(x) ν(x). 2 x In countable Let and Y be random variables with distributions µ and ν respectively. Couplings are closely related to the total variation distance by the following fact: for all couplings Z = (, Y ). This is deduced easily since and d T V (µ, ν) P( Y ) (16) µ(a) ν(a) = P( A) P(Y A) P( A, Y / A) P( Y ) ν(a) µ(a) = P(Y A) P( A) P(Y A, / A) P( Y ) hold for every A. By using (16), the coupling inequality, with suitable coupling, we can estimate distances and, in particular, establish fascinating convergence results. 13

16 Perhaps the most celebrated and fundamental result obtained using coupling is the exponential convergence of Markov chains towards unique equilibrium distribution. More precisely, let ( 0, 1, 2,...) be an irreducible 6, aperiodic 7, Markov chain with a finite state space S = {s 1, s 2,..., s k }, transition matrix P and arbitrary initial distribution µ. Then, for any stationary distribution π, we have In fact, we have an estimate d T V (P n µ, π) 0, that is, P n µ T V π, d T V (P n µ, π) Cθ n, (17) with C > 0 and θ (0, 1). We assume that reader is familiar with the following facts: 1. In such a setting, there is at least one stationary distribution. [8, p. 26] 2. In such a setting, there is N N such that Pij N 31-33] > 0 for all 1 i, j k. [8, p.29 & In order to prove 17, define another copy (Y n ) of the chain, independent of the original, starting from the distribution π. Then, for every n, Y n is a random variable having distribution P n π = π. Additionally, define α = min{pij N : 1 i k} > 0 and (18) T = min{n : n = Y n }, (19) with the convention that T =, if the chains never meet. A crucial step in the proof is to show that, with probability 1, the two chains actually meet, that is, P(T < ) = 1. To this end, just compute P(T N) P( N = Y n ) P( N = s 1, Y N = s 1 ) = P( N = s 1 )P(Y N = s 1 ) ( k ) ( k ) = P( 0 = s i, N = s 1 ) P(Y N = s i, Y 0 = s 1 ) ( k ) = P( 0 = s i )P( N = s 1 0 = s i ) ( k ) P(Y 0 = s i )P(Y N = s 1 Y 0 = s i ) ( k ) ( k ) P( 0 = s i )α P(Y 0 = s i )α = α 2. That is, P(T > N) 1 α 2. Similarly, given everything that has happened before time N, we have conditional probability at least α 2 of having 2N = Y 2N = s 1, so that P( 2N Y 2N ) T > N) 1 α 2. 6 Markov chain is irreducible if for every 1 i, j k there is n such that P n ij > 0. 7 Markov chain is aperiodic if gcd{n : P( n = i, 0 = i) > 0}, gcd meaning greatest common divisor. 14

17 Consequently, P(T > 2N) = P(T > N)P(T > 2N T > N) (1 α 2 )P(T > 2N T > N) (1 α 2 )P( 2N Y 2N T > N) (1 α 2 ) 2. By repeating this argument, we conclude that P(T > kn) (1 α 2 ) k for every k N. Since the sequence of probabilities P(T > n) is decreasing and it has a subsequence converging to zero, we deduce that lim P(T > n) = 0. n Finally, define a coupling of n and Y n by setting Y n = Y n and n = { n, n < T Y n, n T. That is, both chains act independently until they meet, and then remain stuck together, the evolution being still defined by P. This is a coupling, indeed, since Y n Y n (are equal in distribution) trivially and n n holds since 0 0 and the evolution is defined by the same transition matrix P. The first part of the theorem follows by noticing that d T V (P n µ, π) P( n Y n) = P(T > n) 0 by the coupling inequality. The exponential estimate (17) is also obtained by choosing C properly since we already know that P(T > kn) (1 α 2 ) k =: θ k. Thus, the problem was immediately solved after suitable coupling was figured out. As a corollary, we also deduce that the stationary distribution π is actually unique. Indeed, if π 2 is another stationary distribution, then d T V (π, π 2 ) = d T V (π, P n π 2 n) 0. Thus, the left-hand side, being constant, must equal 0. That is, π(s i ) = π 2 (s i ) for all 1 i k. 4 Uniformly expanding maps on the circle We assume that T : S 1 S 1 is a smooth uniformly expanding map on the circle. More precisely, let S 1 denote the interval [0, 1] with its endpoints identified, let T be C 2, and assume there exists λ > 1 such that Our goal is to prove that T (x) λ x S 1. (20) (1) there exists an absolutely continuous invariant measure 8 (a.c.i.m.), and that 8 Natural choice for the governing σ-algebra is the Borel σ-algebra B. 15

18 (2) the invariant measure is mixing. In fact, we want to prove stronger statements than these and obtain quantitative information about the invariant measure. In addition to (1) and (2), we are going to show that the invariant density (the density of the invariant measure) is actually Lipschitz continuous and that it is exponentially mixing in a suitable class of observables; namely, the class of Hölder continuous functions of arbitrary order. 4.1 Preliminaries We begin by introducing some basic facts that will be used throughout the text. The expansion property below is a manifestation of the hyperbolic nature of the dynamics and inverse branches are needed to treat the transfer operator effectively. The transfer operator itself will be a crucial tool in the analysis of the dynamics Expansion properties and inverse branches First, since T is C 2 and (20) holds, T cannot change sign. Therefore, we can and do assume that T is strictly positive: T (x) λ x S 1. (21) We denote the standard metric on S 1 by d(, ), i.e., d(x, y) = min{ x y, 1 x y } and arc lengths by. Observe that, given an arc I S 1 such that T (I) S 1, it is obviously true by the expanding property (21) that T (I) is an arc with T (I) λ I. Notice that the quantity w = [0,1] T (x) dx is a positive integer. We present the following, rather obvious, result without proof. Lemma 4.1. Let J S 1 be an arc. Then the map T has exactly w well-defined inverse branches on J. More precisely, there exist arcs I i S 1 and maps T 1 i : J I i, 1 i w, such that each restriction T I i is a one-to-one and onto J (in fact a C 2 diffeomorphism in the interior of I i ) and T 1 i is its inverse. By the earlier observation, J = T (I i ) λ I i. Therefore, Lemma 4.1 can be applied repeatedly, resulting in exactly w n well-defined inverse branches of T n on J. Let us denote them (T n ) 1 i : J Ii n, 1 i w n, where each Ii n is an arc. Iterating the previous bound, Ii n λ n J. This results in the lemma below. Lemma 4.2. Let x, y S 1 be arbitrary. Suppose J S 1 is an arc with J 1 2 x, y J. Then d ( (T n ) 1 i (x), (T n ) 1 i (y) ) λ n d(x, y) and for all 1 i w n and all n 1. (Here (T n ) 1 i are the inverse branches of T n on J.) Proof. Consider the arc J J with endpoints x, y. It is a subset of a semicircle in S 1, so that d(x, y) = J. Hence, d ( (T n ) 1 i (x), (T n ) 1 i (y) ) = (T n ) 1 i (J ) λ n J = λ n d(x, y) by the observation preceding the lemma. 16

19 4.1.2 Transfer operator The transfer operator L T : L 2 (m) L 2 (m) is defined by the duality equation f T g dm = f L T g dm f, g L 2 (m). (22) In particular, if ψ is the density of a probability measure ν, Lψ is the density of the push-forward measure T ν = ν T 1. In what follows, the T -subscript is omitted if there is no risk of misconception and L denotes the transfer operator associated with T. One can easily confirm that Lu has the expression Lu(x) = y T 1 {x} u(y) T (y). (23) The missing details in the proof of the following lemma, which summarises basic properties of L, are left to the reader. Lemma 4.3. For transfer operator related to expanding mapping T, 1. Lu L u for all u : S 1 R, 2. 0 u v = 0 Lu Lv for all u, v : S 1 R, 3. Lu dm = u dm for all u L1 (m) and 4. Lu L 1 (m) u L 1 (m) for all u L 1 (m). Proof. 1. Use (23) and triangle inequality. 2. Use (23) and linearity of L. 3. By Equation (22) Lu dm = 4. Combine properties 1 and 3. 1 Lu dm = 1 T u dm = u dm u L 2 One can also deduce expressions Lu(x) = (Lu) (x) = w y=1 w u(t 1 i (x)) T (T 1 i (x)) ( u (T 1 i (x)) T (T 1 i (x)) u(t 1 i (x)) 2 T (T 1 (x)) T (T 1 3 i i ) (x)) (24) (25) in case u C 1 using inverse branches of T. 17

20 4.2 Existence of a.c.i.m. The existence of the a.c.i.m. is proved by compactness argument similar to Arzelà Ascoli theorem. More precisely, we are about to prove that C 1 -norms of the pushforward densities remain uniformly bounded and bounded subsets of C 1 are compact in the space of Lipschitz functions endowed with its natural norm. Actually, with minor, obvious, modifications in the proof of our argument one easily proves the Arzelà Ascoli theorem, one of the great results in classical analysis. It is worth mentioning that the strategy we use, that is, establishing invariance by taking averages along trajectories and using compactness to deduce that a limit exists, is widely used method to construct invariant measures Boundedness In this subsection, we derive estimates for the transfer operator which C. Liverani motivated in [9]. The ultimate result we wish to prove is that for every fixed ψ C 1 there is L > 0 such that sup n N L n ψ C 1 = sup ( L n ψ + (L n ψ) ) L. (26) n N We begin by pointing out that the frequently used estimate (Lψ) 1 λ L ψ + T λ 2 L ψ (27) follows directly from (25). Using this estimate and the fact the L is a contraction in L 1 (m) (property 4 in Lemma 4.3) we can control L 1 -norms of the pushforwards. Let us prove that these are uniformly bounded with respect to n. Lemma 4.4. Let ψ C 1 be arbitrary. For every n N we have the following upper bound (L n ψ) L 1 (m) 1 λ n ψ L 1 (m) + T λ(λ 1) ψ L 1 (m). (28) Proof. The n = 1 case is a direct consequence of the contracting nature of L and (27): (Lψ) L 1 (m) 1 λ L ψ L 1 (m) + T L ψ λ 2 L 1 (m) 1 λ ψ L 1 (m) + T λ(λ 1) ψ L 1 (m). Next we claim that (L n ψ) L 1 (m) 1 λ n ψ L 1 (m) + T ψ L 1 (m) n+1 i=2 1 λ i (29) holds for any n N. To this end, we proceed by induction. Suppose (29) holds for n = k. For n = k + 1 we have (L k+1 ψ) L 1 (m) 1 λ L (Lk ψ) L 1 (m) + T L L k ψ λ 2 L 1 (m) 1 λ (Lk ψ) L 1 (m) + T ( λ 2 1 λ ψ L 1 (m) 1 λ k ψ L 1 (m) + T ψ L 1 (m) k+1 i=2 1 k+2 λ k+1 ψ L 1 (m) + T 1 ψ L 1 (m) λ. i 18 i=2 ) 1 λ i + T ψ λ 2 L 1 (m)

21 Thus, we conclude that (29) holds for any n N. Finally, we observe that n+1 i=2 1 λ i 1 λ(λ 1) n N which finishes the proof. The constant T /(λ(λ 1)) appearing in the left-hand side of (28) is so extensively used throughout this study that we denote it by the symbol Ω, i.e., Ω := T λ(λ 1). (30) Constant Ω in a way measures the regularity of the dynamics, that is, the linearity of the dynamics compared to the minimum magnitude of stretching. Remark 4.1. The contribution of the derivative of the initial funtion ψ to the upper bound in (28) becomes neglible after a sufficiently long time. This is very general property of expanding dynamics; in the long term, sufficiently regular functions become even more smoothly distributed. The reason to examine L 1 -norms is that in case of C 1 probability densities we can bound sup-norms by L 1 -norms. More precisely, if ψ is differentiable density there is at least one point x 0 in which ψ(x 0 ) = 1. Otherwise it would not be possible to have ψ dm = 1. Thus, x x ψ(x) = ψ (t)dt + ψ(x 0 ) ψ (t) dt + 1 ψ L 1 (m) + 1 x 0 x 0 holds for every x S 1 and therefore ψ ψ L 1 (m) + 1. (31) Remark that L n v v L n 1 holds by the second part of lemma 4.3, where 1 = χ S 1 C 1. We now have the tools to establish the desired boundedness result. Theorem 4.1. For every bounded ψ sup L n ψ (1 + Ω) ψ n N Theorem 4.2. For every continuously differentiable ψ sup L n ψ C 1 (1 + Ω) 2 ψ C 1 n N Proof of the Theorem 4.1. Notice that L n 1 is C 1 for all n N. Therefore, estimates (28) and (31) results in ( L n ψ ψ L n 1 ψ (L n 1) L 1 (m) + 1 ) ( ) T ψ λ(λ 1)

22 Proof of the Theorem 4.2. In order to achieve uniform bound, we first claim that (L n ψ) 1 n+1 λ n Ln ψ + T i=2 L n+2 i ψ λ i. By estimate (27) this holds when n = 1. For n > 1 we proceed by induction. Suppose the inequality holds for n = k. If n = k + 1 it holds that (L k+1 ψ) 1 λ L (Lk ψ) + T L k+1 ψ λ 2 1 λ L 1 λ k Lk ψ + T k+1 i=2 1 k+1 λ k+1 Lk+1 ψ + T i=2 = 1 k+2 λ k+1 Lk+1 ψ + T i=2 L k+2 i ψ λ i L k+3 i ψ λ i+1 L k+3 i ψ λ i. + T L k+1 ψ λ 2 + T L k+1 ψ λ 2 Therefore, by the induction principle, the estimate holds for any n N. estimate it is easy to deduce the following bound From this What we have proved is that (L n ψ) (1 + Ω)( ψ + Ω ψ ) (32) L n ψ C 1 (1 + Ω) 2 ψ + (1 + Ω) ψ (1 + Ω) 2 ψ + (1 + Ω) 2 ψ = (1 + Ω) 2 ψ C 1 (33) for all natural numbers n and continuously differentiable ψ Compactness We begin by proving the compactness result that guarantees the existence of a Lipschitz continuous invariant density. Lemma 4.5. Suppose that a sequence of C 1 functions f n : S 1 R is bounded, i.e., sup n f n C 1 L <. Then it has a subsequence (f nk ) k which converges uniformly to a Lipschitz function f having Lipschitz constant Lip(f) L. Proof. Let {q 1, q 2,...} be the set of rational points in S 1. Since sup n f n <, it follows that {f n (q 1 ) : n N} is a compact subset of R. Thus, we have a subsequence (n 1 j) j such that f n 1 j (q 1 ) y 1 R. Similarly, by compactness of {f n 1 j (q 2 ) : j N}, there is a subsequence (n 2 j) j of (n 1 j) j such that f n 2 j (q 2 ) y 2 R. By continuing inductively we obtain subsequences (n k j ) j such that (n k j ) j is a subsequence of (n k 1 j ) j and f n k j (q k ) y k as j tends to infinity. By defining the diagonal subsequence m k = n k k, we obtain subsequence of all inductively defined sequences above. Therefore, it satifsfies f mk (q j ) k y j. 20

23 The densely convergent diagonal sequence f mn is from now on denoted by f n to simplify notation. In order to achieve convergence everywhere, let ɛ > 0. Since f n (q) converges at rational points, for every rational q there is N q such that Since M := sup n f n <, we also have n, m > N q = f n (q) f m (q) < ɛ/3. x y < ɛ/m = f n (x) f n (y) < ɛ/3 n N. Now, we can pick rational q close to x and estimate f n (x) f m (x) f n (x) f n (q) + f n (q) f m (q) + f m (q) f m (x) < ɛ when n, m > N q. Thus, (f n ) n is pointwise convergent by completeness of R. To achieve uniform convergence, we partition the circle into intervals I j of length ɛ/(2m) so that x, y I j = x y < ɛ/(2m) = f n (x) f n (y) < ɛ/3. Pick a rational q j from each I j and define N = max qj N qj which is finite by finiteness of the set {q j }. By definitions of the partition and N, the following holds for every x f n (x) f m (x) f n (x) f n (q) + f n (q) f m (q) + f m (q) f m (x) < ɛ/3 + ɛ/3 + ɛ/3 = ɛ, where q is the selected rational from the interval in which x lies. uniformly Cauchy, hence uniformly convergent. Thus, sequence is Additionally, each member of the sequence is Lipschitz-continuous with Lipschitz constant smaller than sup n f n C 1. Thus, last thing to prove is that uniform limit of Lipschitz functions with Lipschitz constants smaller than L is also Lipschitz with constant smaller than L. To this end, suppose ɛ > 0. Due to uniform convergence, there is n 0 N such that n > n 0 f n (x) f(x) < ɛ for every x. Then we have the following estimate f(x) f(y) f(x) f n (x) + f n (x) f n (y) + f n (y) f(y) L x y + 2ɛ for every every x, y. Since ɛ > 0 is arbitrary, we deduce that f is also Lipschitz with its Lipschitz constant bounded by L. Remark 4.2. It is worth noticing that the crucial properties of C 1-bounded sequences we used are the very same as the ones in Arzelà-Ascoli theorem. That is, the sets {f n (x) : n N} are bounded and {f n : n N} is equicontinuous. The last result can be generalised a little and re-interpreted as follows: Closed, bounded, subsets of C 1 are compact in the space of Lipschitz continuous functions endowed with its natural norm = + Lip( ). With the aid of this lemma we are ready to prove the main theorem of the chapter. Theorem 4.3 (The existence of the Lipschitz continuous invariant density). There is an invariant measure µ such that it is absolutely continuous w.r.t. Lebesgue measure and its density is Lipschitz continuous with its Lipschitz constant bounded by quantity depending only on the dynamics of the system. More precisely, the density φ = dµ/dm satisfies Lip(φ) (1 + Ω) 2. 21

24 Proof. Let ψ be C 1 -density associated to absolutely continuous, not necessarily invariant, probability measure ν. By theorem 4.2, we know that densities L n ψ are contained in some closed ball in C 1. Now, define a sequence of probability measures µ n by densities of the form ψ N = 1 N N 1 n=0 L n ψ. (34) Since the averages are contained in the closed ball, we know that there is a subsequence (ψ) nj such that ψ nj converges uniformly to a Lipschitz density φ. Let µ be the probability measure defined by the limit density φ. In order to show it is invariant, let A be an arbitrary measurable set. By the bounded convergence theorem and trivial observation that ψ nj dm = 1 T 1 A ψ nj dm = 1 A T ψ nj dm = 1 A Lψ nj dm = Lψ nj dm, T 1 A S 1 S 1 S 1 A we have the following equality µ(t 1 A) µ(a) = T 1 A lim ψ n j dm j A lim ψ n j dm j = lim ψ nj dm lim ψ nj dm j T 1 A j A = lim (Lψ nj ψ nj ) dm. j A Now the proof of invariance is accomplished by the following estimate: (Lψ nj ψ nj ) dm = L n j ψ ψ dm n j n j A A 1 n j ( L n j ψ L 1 + ψ L 1) 2 ψ L 1 n j 0. An upper bound for Lipschitz constant is obtained by estimates on the Lipschitz constants of ψ nj : Lip(φ) sup Lip(ψ nj ) j N sup ψ nj C 1 j N (1 + Ω) 2 ψ C 1 Additionally, since ψ was arbitrary, we may pick ψ 1 to obtain Lip(φ) (1 + Ω) 2. (35) Remark that this is just an existence theorem, not uniqueness; we have showed that every C 1 density ψ results in at least one invariant density. As in the case of Markov chains in chapter 3, uniqueness follows from the convergence of an arbitrary density towards any invariant density. 22

25 4.3 Exponential decay of correlation coefficients Recall that the mixing property can be equivalently stated in terms of decay of correlation coefficients for all L 2 -functions: f T n g dµ = f dµ g dµ f, g L 2 (µ). (36) lim n In previous section, we proved that µ is absolutely continuous with respect to the Lebesgue measure. Denoting the density by φ, mixing is equivalent to f T n gφ dm = fφ dm gφ dm f, g L 2 (µ). lim n As motivated at the beginning of this chapter, we are willing to restrict ourselves to Hölder continuous g in order to obtain concrete estimates on the error terms f T n gφ dm fφ dm gφ dm. To this end, remark that, by the linearity of integrals, we may add 2 fφ dm g both sides and normalise the equality to g dµ + 2 g f T n φ(g + 2 g ) dm fφ dm g dµ + 2 g. Denoting ψ = φ(g + 2 g ) g dµ + 2 g, we can reduce the decay of correlation coefficients to L 1 (m) convergence of probability densities towards the invariant density: f T n gφ dm fφ dm gφ dm 3 g f L n ψ φ L 1 (m). Therefore, it is sufficient to prove the inequality L n ψ φ L 1 (m) De cn f, g C α (37) with D, c > 0 in order to achieve exponentially decaying estimates. This will be proved after we have developed some tools needed to construct the coupling argument. In what follows, we use the notation f(x) f(y) H α (f) = sup x,y d(x, y) α for the Hölder coefficient of a function f and denote the class of Hölder continuous functions of order α by C α. That is, C α = {f : R : H α (f) < } 23

26 4.3.1 Preliminary results The following distortion bound is central for understanding the structure of push-forward densities L n ψ. The core idea of the proof is basically the same as Lai-Sang Young s in [10, p ]. Lemma 4.6. Let n N be arbitrary. For any x, y S 1, e Ωd(x,y) (T n ) ((T n ) 1 i x) (T n ) ((T n ) 1 i y) eωd(x,y), (38) in which Ω = T /(λ(λ 1)) is independent of n. Here (T n ) 1 i is the ith branch of the inverse of T n on a given arc J S 1 of length J 1 containing both x and y. 2 Proof of Lemma 4.6. Let J be as in the statement of the lemma. Lemma 4.2 implies that, for an arbitrary k 1, the inverse branches of T k on J are well defined. For brevity, let x k and y k denote the preimages of x and y, respectively, along the same branch. Since the logarithm is λ 1 Lipschitz in [λ, T ], we can estimate log (T n ) (x n ) (T n ) (y n ) log((t n ) (x n )) log((t n ) (y n )) n 1 log(t (T i x n )) log(t (T i y n )) n 1 n 1 1 λ T d(x n+i, y n+i ) 1 λ T λ n+i d(x, y) T d(x, y) = Ωd(x, y). λ(λ 1) A similar estimate is obtained by interchanging x and y, which proves the claim. This lemma is immediately used in the following theorem which is the cornerstone of our coupling argument. The trick is that sometimes, particularly in our case, it is more convenient to work with logarithms of functions instead of functions themselves. Theorem 4.4. Suppose ψ is a strictly positive probability density and that log ψ C α. Then L n ψ has the same properties for every n N and H α (log L n ψ) H α(log ψ) λ αn + Ω. Proof. Let J S 1 be an interval with J 1. Given an initial probability density ψ, we 2 introduce the notation ψ n,i (x) = ψ((t n ) 1 i x) (T n ) ((T n ) 1 i x), x J, 24