An Introduction to the Wavelet Analysis of Time Series Don Percival Applied Physics Lab, University of Washington, Seattle Dept. of Statistics, University of Washington, Seattle MathSoft, Inc., Seattle
Overview wavelets are analysis tools mainly for time series analysis (focus of this tutorial) image analysis (will not cover) as a subject, wavelets are relatively new (1983 to present) synthesis of many new/old ideas keyword in 10, 558+ articles & books since 1989 (000+ in the last year alone) broadly speaking, have been two waves of wavelets continuous wavelet transform(1983 and on) discrete wavelet transform(1988 and on) 1
Game Plan introduce subject via CWT describe DWT and its main products multiresolution analysis (additive decomposition) analysis of variance ( power decomposition) describe selected uses for DWT wavelet variance (related to Allan variance) decorrelation of fractionally differenced processes (closely related to power law processes) signal extraction (denoising)
What is a Wavelet? wavelet is a small wave (sinusoids are big waves ) real-valued ψ(t) is a wavelet if 1. integral of ψ(t) is zero: ψ(t) dt = 0. integral of ψ (t) is unity: ψ (t) dt = 1 (called unit energy property) wavelets so defined deserve their name because # says we have, for every small E >0, T ψ (t) dt < 1 E, T for some finite T (might be quite large!) length of [ T,T ] small compare to [, ] # says ψ(t) must be nonzero somewhere #1 says ψ(t) balances itself above/below 0 Fig. 1: three wavelets Fig. : examples of complex-valued wavelets 3
Basics of Wavelet Analysis: I wavelets tell us about variations in local averages to quantify this description, let x(t) be a signal real-valued function of t will refer to t as time (but can be, e.g., depth) consider average value of x(t) over[a, b]: 1 b x(u) du α(a, b) b a a reparameterize in terms of λ & t A(λ, t) α(t λ,t + λ )= λ b a is called scale 1 t+ λ λ t λ t = (a + b)/ is center time of interval x(u) du A(λ, t) is average value of x(t) over scale λ at t 4
Basics of Wavelet Analysis: II average values of signals are of wide-spread interest hourly rainfall rates monthly mean sea surface temperatures yearly average temperatures over central England etc., etc., etc. (Rogers & Hammerstein, 1951) Fig. 3: fractional frequency deviates in clock 571 can regard as averages of form[t 1,t+ 1 ] t is measured in days (one measurment per day) plot shows A(1,t) versus integer t A(1,t)=0 master clock & 571 agree perfectly A(1,t) < 0 clock 571 is losing time can easily correct if A(1,t) constant quality of clock related to changes in A(1,t) 5
Basics of Wavelet Analysis: III can quantify changes in A(1,t) via D(1,t 1 ) A(1,t) A(1,t 1) or, equivalently, = t+ 1 1 t 1 x(u) du t t 3 x(u) du, D(1,t) = A(1,t + 1 ) A(1,t 1 ) t+1 t = x(u) du x(u) du t t 1 generalizing to scales other than unity yields D(λ, t) A(λ, t + λ ) A(λ, t λ ) 1 t+λ 1 t = x(u) du x(u) du λ t λ λ t D(λ, t) often of more interest than A(λ, t) can connect to Haar wavelet: write with D(λ, t) = ψ λ,t (u)x(u) du ψ λ,t (u) 1/λ, t λ u <t; 1/λ, t u <t+ λ; 0, otherwise. 6
Basics of Wavelet Analysis: IV specialize to case λ = 1 and t =0: ψ 1,0 (u) comparison to ψ H(u) yields 1, 1 u <0; 1, 0 u <1; 0, otherwise. ψ H 1,0 (u) = ψ (u) Haar wavelet mines out info on difference between unit scale averages at t = 0 via ψ H (u)x(u) du W H (1, 0) to mine out info at other t s, just shift ψ H(u): 1, t 1 u <t; ψ H H H 1 1,t(u) ψ (u t); i.e., ψ 1,t (u) =, t u<t+1; 0, otherwise Fig. 4: top row of plots to mine out info about other λ s, form 1 λ, t λ u<t; H 1 u t ψ λ,t (u) ψ H = 1, t u <t+ λ; λ λ 0, λ otherwise. Fig. 4: bottomrow of plots 7
Basics of Wavelet Analysis: V H can check that ψ λ,t (u) is a wavelet for all λ & t H use ψ λ,t (u) to obtain W H (λ, t) ψ H λ,t(u)x(u) du D(λ, t) left-hand side is Haar CWT can do the same with other wavelets: 1 u t W (λ, t) ψλ,t (u)x(u) du, where ψ λ,t (u) ψ λ λ left-hand side is CWT based on ψ(u) interpretation for ψ fdg(u) and ψ Mh(u) (Fig. 1): differences of adjacent weighted averages 8
Basics of Wavelet Analysis: VI basic CWT result: if ψ(u) satisfies admissibility condition, can recover x(t) fromits CWT: 1 1 t u dλ x (t) = W (λ, t) ψ du, λ λ C 0 ψ λ where C ψ is constant depending just on ψ conclusion: W (λ, t) equivalent to x(t) can also show that 1 dλ x (t) dt = W (λ, t) dt C 0 λ LHS called energy in x(t) ψ RHS integrand is energy density over λ & t Fig. 3: Mexican hat CWT of clock 571 data 9
Beyond the CWT: the DWT critique: have transformed signal into an image can often get by with subsamples of W (λ, t) leads to notion of discrete wavelet transform(dwt) can regard as dyadic slices through CWT can further subsample slices at various t s DWT has appeal in its own right most time series are sampled as discrete values (can be tricky to implement CWT) can formulate as orthonormal transform (facilitates statistical analysis) approximately decorrelates certain time series (including power law processes) standardization to dyadic scales often adequate can be faster than the fast Fourier transform! will concentrate on DWT for remainder of tutorial 10
Overview of DWT let X = [X 0,X 1,...,X N 1 ] T be observed time series (for convenience, assume N integer multiple of J 0) let W be N N orthonormal DWT matrix W = WX is vector of DWT coefficients orthonormality says X = W T W,so X W can partition W as follows: W1 W =. W J 0 V J 0 j Wj contains N j = N/ wavelet coefficients j 1 related to changes of averages at scale τj = (τ j is jth dyadic scale) related to times spaced j units apart V contains N = N/ J 0 J 0 J 0 scaling coefficients related to averages at scale λ J0 = J 0 related to times spaced J 0 units apart 11
Example: Haar DWT Fig. 5: W for Haar DWT with N =16 first 8 rows yield W 1 changes on scale 1 next 4 rows yield W changes on scale next rows yield W 3 changes on scale 4 next to last row yields W 4 change on scale 8 last row yields V 4 average on scale 16 Fig. 6: Haar DWT coefficients for clock 571 1
DWT in Terms of Filters filter X 0,X 1,...,X N 1 to obtain Lj 1 j/ W j,t J h j,l X t l mod N, t =0, 1,...,N l=0 where h j,l is jth level wavelet filter note: circular filtering 1 subsample to obtain wavelet coefficients: W j,t = j/ W j, j (t+1) 1, t =0, 1,...,N j 1, where W j,t is tth element of W j Figs. 7 & 8: Haar, D(4), C(6) & LA(8) wavelet filters jth wavelet filter is band-pass with pass-band [ 1 1 j+1, j ] note: jth scale related to interval of frequencies similarly, scaling filters yield V J0 Figs. 9 & 10: Haar, D(4), C(6) & LA(8) scaling filters J 1 0th scaling filter is low-pass with pass-band [0, J 0 +1 ] 13
Pyramid Algorithm: I can formulate DWT via pyramid algorithm elegant iterative algorithmfor computing DWT implicitly defines W computes W = WX using O(N) multiplications brute force method uses O(N ) FFT algorithmuses O(N log N) algorithmmakes use of two basic filters wavelet filter h l of unit scale h l h 1,l associated scaling filter g l 14
The Wavelet Filter: I let h l,l =0,...,L 1, be a real-valued filter L is filter width so h 0 =0& h L 1 =0 L must be even assume h l = 0 for l <0& l L h l called a wavelet filter if it has these 3 properties 1. summation to zero: L J 1 h l =0 l=0. unit energy: L J 1 h l =1 l=0 3. orthogonality to even shifts: L J 1 h l h l+n = J h l h l+n =0 l=0 l= for all nonzero integers n & 3 together called orthonormality property 15
The Wavelet Filter: II transfer & squared gain functions for h l : L J 1 iπfl H(f) h l e & l=0 H(f) H(f) can argue that orthonormality property equivalent to H(f)+ H(f + 1 ) = for all f Fig. 11: H(f) for Daubechies wavelet filters L = case is Haar wavelet filter filter cascade with averaging & differencing filters high-pass filter with pass-band [ 1, 1 ] 4 can regard as half-band filter 16
The Scaling Filter: I scaling filter: g l ( 1) l+1 h L 1 l reverse h l & flip sign of every other coefficient e.g.: h 0 = 1 & h 1 = 1 1 g0 = g 1 = g l is quadrature mirror filter for h l properties of h l imply g l has these properties: 1. summation to ±, so will assume L J 1 l=0 g l =. unit energy: L J 1 g l =1 l=0 3. orthogonality to even shifts: LJ 1 J g l g l+n = g l g l+n =0 l=0 l= for all nonzero integers n 4. orthogonality to wavelet filter at even shifts: L J 1 g l h l+n = J g l h l+n =0 l=0 l= for all integers n 17
The Scaling Filter: II transfer & squared gain functions for g l : L 1 G(f) J gle iπfl l=0 & G(f) G(f) can argue that G(f) = H(f 1 ) have G(0) = H( 1 )= H( 1 )& G( 1 )= H(0) since h l is high-pass, g l must be low-pass low-pass filter with pass-band [0, 1 ] 4 can also regard as half-band filter orthonormality property equivalent to G(f)+ G(f + 1 )= or H(f)+ G(f) = for all f 18
Pyramid Algorithm: II define V 0 X and set j = 1 input to jth stage of pyramid algorithm is V j 1 V j 1 is full-band related to frequencies [0, 1 j ]in X filter with half-band filters and downsample: L 1 J W j,t h l V j 1,t+1 l mod Nj 1 l=0 L J 1 V j,t g l V j 1,t+1 l mod Nj 1, l=0 t =0,...,N j 1 place these in vectors W j & V j Wj are wavelet coefficients for scale τ j = j 1 V j are scaling coefficients for scale λ j = j increment j and repeat above until j = J 0 yields DWT coefficients W 1,...,W J0, V J0 19
Pyramid Algorithm: III can formulate inverse pyramid algorithm (recovers V j 1 from W j and V j ) algorithmimplicitly defines transformmatrix W partition W commensurate with W j : W = W1 W1 W W.. parallels W =.. W J0 W J0 V J0 V J0 rows of W j use jth level filter h j,l with DFT 1 j H( j f) G( l f) l=0 (h j,l has L j =( j 1)(L 1) + 1 nonzero elements) W j is N j N matrix such that Wj = W j X and W j W T j = I Nj 0
Two Consequences of Orthonormality multiresolution analysis (MRA) X = W T W = J W T J 0 T J 0 j W j + V J V 0 J 0 j=1 j=1 scale-based additive decomposition D j s & S J0 called details & smooth analysis of variance consider energy in time series: IXI = X T X = NJ 1 Xt t=0 J D j + S J0 energy preserved in DWT coefficients: IWI = IWXI = X T W T WX = X T X = IXI since W 1,...,W J0, V J0 partitions W, have J 0 IWI = IWj I + IV J0 I, j=1 leading to analysis of sample variance: 1 N J 1 1 J 0 1 σˆ (X I I I I t X) = W j + V J0 X N t=0 N j=1 N scale-based decomposition (cf. frequency-based) 1
Variation: Maximal Overlap DWT can eliminate downsampling and use 1 L J j 1 j/ l=0 W j,t h j,l X t l mod N, t =0, 1,...,N 1 to define MODWT coefficients W j (& also V j) unlike DWT, MODWT is not orthonormal (in fact MODWT is highly redundant) like DWT, can do MRA & analysis of variance: IXI = J 0 IW j I + IV J 0 I j=1 unlike DWT, MODWT works for all samples sizes N (i.e., power of assumption is not required) if N is power of, can compute MODWT using O(N log N) operations (i.e., same as FFT algorithm) contrast to DWT, which uses O(N) operations Fig. 1: Haar MODWT coefficients for clock 571 (cf. Fig. 6 with DWT coefficients)
Definition of Wavelet Variance let X t, t =..., 1, 0, 1,..., be a stochastic process run X t through jth level wavelet filter: L W j,t J j 1 hj,l X t l, t =..., 1, 0, 1,..., l=0 which should be contrasted with W Lj 1 j,t J hj,l X t l mod N, t =0, 1,...,N 1 l=0 definition of time dependent wavelet variance (also called wavelet spectrum): ν X,t(τ j ) var {W j,t }, assuming var {W j,t } exists and is finite ν X,t(τ j ) depends on τ j and t will consider time independent wavelet variance: ν X(τ j ) var {W j,t } (can be easily adapted to time varying situation) 3
Rationale for Wavelet Variance decomposes variance on scale by scale basis useful substitute/complement for spectrum useful substitute for process/sample variance 4
Variance Decomposition suppose X t has power spectrum S X (f ): 1/ SX (f ) df =var {X t }; 1/ i.e., decomposes var {X t } across frequencies f involves uncountably infinite number of f s S X (f ) f contribution to var {X t } due to f s in interval of length f centered at f wavelet variance analog to fundamental result: J ν j =1 X (τ j )=var{x t } i.e., decomposes var {X t } across scales τ j recall DWT/MODWT and sample variance involves countably infinite number of τ j s ν X (τ j ) contribution to var {X t } due to scale τ j ν X (τ j ) has same units as X t (easier to interpret) 5
Spectrum Substitute/Complement because h j,l bandpass over [1/ j+1, 1/ j ], ν X (τ j ) 1/ j +1 S X (f) df 1/ j if S X (f) featureless, info in ν X (τ j ) info in S X (f) ν X (τ j ) more succinct: only 1 value per octave band α example: SX (f) f, i.e., power law process can deduce α fromslope of log S X (f) vs. log f implies ν X (τ j ) τ j α 1 approximately can deduce α fromslope of log ν X (τ j ) vs. log τ j no loss of info using ν X (τ j ) rather than S X (f) with Haar wavelet, obtain pilot spectrumestimate proposed in Blackman & Tukey (1958) 6
Substitute for Variance: I can be difficult to estimate process variance ν X(τ j ) useful substitute: easy to estimate & finite let µ = E{X t} be known, σ =var {X t } unknown can estimate σ using 1 N J 1 σ (X t µ) N t=0 estimator above is unbiased: E{σ } = σ if µ is unknown, can estimate σ using σˆ 1 NJ 1 N t=0 (X t X) there is some (non-pathological!) X t such that E{σˆ} <E σ for any gvien E >0& N 1 σˆ can badly underestimate σ! example: power law process with 1 < α<0 7
Substitute for Variance: II Q: why is wavelet variance useful when σ is not? replaces global variability with variability over scales if X t stationary with mean µ, then L j J 1 L j 1 E{W j,t } = h j,l E{ X t l } = µ J h j,l =0 l=0 l=0 because l h j,l =0 E{W j,t } known, so can get unbiased estimator of var {W j,t } = ν X (τ j ) certain nonstationary X t have well-defined ν X (τ j ) example: power law processes with α 1 (example of process with stationary increments) 8
Estimation of Wavelet Variance: I can base estimator on MODWT of X 0,X 1,...,X N 1 : Lj 1 W j,t J h j,l X t l mod N, t =0, 1,...,N 1 l=0 (DWT-based estimator possible, but less efficient) recall that L j 1 W j,t J h j,l X t l, t =0, ±1, ±,... l=0 so W j,t = W j,t if mod not needed: L j 1 t <N if N L j 0, unbiased estimator of ν X (τ j )is 1 N J 1 1 N J 1 νˆ X (τ W j ) j,t = W j,t, N L j +1 t=lj 1 M where M j N L j +1 j t=l j 1 can also construct biased estimator of ν X (τ j ): N L J W J J j,t = Wj,t + W N j,t t=0 t=0 t=lj 1 1 1 1 j N 1 ν X (τ j ) N 1st sumin parentheses influenced by circularity 9
Estimation of Wavelet Variance: II biased estimator unbiased if {X t } white noise biased estimator offers exact analysis of σˆ; unbiased estimator need not biased estimator can have better mean square error (Greenhall et al., 1999; need to reflect X t ) 30
Statistical Properties of νˆx (τ j ) suppose {W j,t } Gaussian, mean 0&spectrum S j (f) suppose square integrability condition holds: A j 1/ S j (f) df < & S j (f) > 0 1/ (holds for power law processes if L large enough) can show νˆx (τ j ) asymptotically normal with mean ν X (τ j ) & large sample variance A j /M j can estimate A j and use with νˆ X (τ j ) to construct confidence interval for ν X (τ j ) example Fig. 13: clock errors X t X t along with (i) (i 1) (i 1) differences X t X t X t 1 for i = 1, Fig. 14: νˆx (τ j ) for clock errors Fig. 15: νˆ (τ j ) for Y t Y (1) X t Haar νˆ (τ Y j) related to Allan variance σ Y (,τ j ): (0) ν (τ j )= 1 σ Y ( j ) Y,τ 31
Decorrelation of FD Processes X t fractionally differenced if its spectrumis σ E S X (f) =, sin(πf) δ where σ E > 0 and 1 <δ< 1 note: for small f, have S X (f) C/ f δ ; i.e., power law with α = δ if δ = 0, FD process is white noise if 0 < δ< 1, FD stationary with long memory can extend definition to δ 1 nonstationary 1/f type process also called ARFIMA(0,δ,0) process Fig. 16: DWT of simulated FD process, δ = 0.4 (sample autocorrelation sequences (ACSs) on right) 3
DWT as Whitening Transform sample ACSs suggest W j uncorrelated since FD process is stationary, so are W j (ignoring terms influenced by circularity) Fig. 17: spectra for W j, j = 1,, 3, 4 W j & W j, j = j, approximately uncorrelated (approximation improves as L increases) DWT thus acts as a whitening transform lots of uses for whitening property, including: 1. testing for variance changes. bootstrapping time series statistics 3. estimating δ for stationary/nonstationary fractional difference processes with trend 33
Estimation for FD Processes: I extension of work by Wornell; McCoy & Walden problem: estimate δ fromtime series U t such that where U t = T t + X t T t r j=0 a j jt is polynomial trend X t is FD process, but can have δ 1 DWT wavelet filter of width L has embedded differencing operation of order L/ if L r + 1, reduces polynomial trend to 0 can partition DWT coefficients as where W = W s + W b + W w W s has scaling coefficients and 0s elsewhere W s has boundary-dependent wavelet coefficients W w has boundary-independent wavelet coefficients 34
Estimation for FD Processes: II since U = W T W, can write U = W T (Ws + W b )+ W W w T + X Fig. 18: example with fractional frequency deviates can use values in W w to formlikelihood: where L(δ, σ E ) T N J 0 j 1 W /(σ j ) j,t+l 1 e j 1/ j=1 t=1 πσj σ 1/ σ E j H 1/ j(f) df ; sin(πf) δ and H j (f) is squared gain for h j,l leads to maximum likelihood estimator δˆ for δ works well in Monte Carlo simulations get δ ˆ =0.39. ± 0.03 for fractional frequency deviates 35
DWT-based Signal Extraction: I DWT analysis of X yields W = WX DWT synthesis X = W T W yields multiresolution analysis (MRA) estimator of signal D hidden in X: modify W to get W use W to formsignal estimate: D W T W key ideas behind wavelet-based signal estimation DWT can isolate signals in small number of W n s can threshold or shrink W n s key ideas lead to waveshrink (Donoho and Johnstone, 1995) 36
DWT-based Signal Extraction: II thresholding schemes involve 1. computing W WX. defining W (t) as vector with nth element (t) 0, if W W n = n δ; some nonzero value, otherwise, where nonzero values are yet to be defined 3. estimating D viad (t) W T W (t) simplest scheme is hard thresholding: W (ht) n = 0, if Wn δ; W n, otherwise. Fig. 19: solid line ( kill/keep strategy) alterative scheme is soft thresholding: where W (st) n = sign {W n } ( W n δ) +, +1, if W n > 0; x, if x 0; sign {W n } 0, if W n =0; and (x) + 0, if x < 0. 1, if W n < 0. Fig. 19: dashed line 37
DWT-based Signal Extraction: III third scheme is mid thresholding: W (mt) n = sign {W n } ( W n δ) ++, where ( Wn δ) δ; ( +, if W n < W n δ) ++ W n, otherwise Fig. 19: dotted line Q: how should δ be set? A: universal threshold (Donoho & Johnstone, 1995) (lots of other answers have been proposed) specialize to model X = D + E, where E is Gaussian white noise with variance σe universal threshold: δ U [σe log(n)] rationale for δ U: suppose D = 0 & hence W is white noise also as N, have P max W n δ U 1 n so all W (ht) = 0 with high probability will estimate correct D with high probability 38
DWT-based Signal Extraction: IV can estimate σ E using median absolute deviation (MAD): median { W 1,0, W 1,1,..., W N 0.6745 1, } 1 σˆ(mad), where W 1,t s are elements of W 1 Fig. 0: application to NMR series has potential application in dejamming GPS signals (with roles of signal and noise swapped!) 39
Web Material and Books Wavelet Digest http://www.wavelet.org/ MathSoft s wavelet resource page books http://www.mathsoft.com/wavelets.html R. Carmona, W. L. Hwang & B. Torrésani (1998), Practical Time-Frequency Analysis, Academic Press S. G. Mallat (1999), A Wavelet Tour of Signal Processing (Second Edition), Academic Press R. T. Ogden (1997), Essential Wavelets for Statistical Applications and Data Analysis, Birkhäuser D. B. Percival & A. T. Walden (000), Wavelet Methods for Time Series Analysis, Cambridge University Press (will appear in July/August) http://www.staff.washington.edu/dbp/wmtsa.html B. Vidakovic (1999), Statistical Modeling by Wavelets, John Wiley & Sons. 40
Software Matlab Wavelab (free): http://www-stat.stanford.edu/~wavelab WaveBox (commercial): http://www.toolsmiths.com/ Mathcad Wavelets Extension Pack (commercial): http://www.mathsoft.com/mathcad/ebooks/wavelets.asp S-Plus software WaveThresh (free): http://lib.stat.cmu.edu/s/wavethresh S+Wavelets (commercial): http://www.mathsoft.com/splsprod/wavelets.html 41