Homogenization of long-term monthly Spanish temperature data

Transcription

1 INTERNATIONAL JOURNAL OF CLIMATOLOGY Published online in Wiley InterScience ( Homogenization of long-term monthly Spanish temperature data M. Staudt,* M. J. Esteban-Parra and Y. Castro-Díez Departamento de Física Aplicada, Universidad de Granada, Granada, Spain Abstract: Reliable time-series is the basic ingredient when analysing climatic changes. However, the errors in real data are frequently of the same order as the signal being sought. Therefore, the available long-term monthly series of Spanish minimum and maximum temperatures have been compiled from the late 19th century on, in order to compile a high-quality data set. The series are organized into climatically homogeneous regional groups and, in each group, the detection and adjustment is based on relative homogeneity and an analysis of the stationarity of the whole set of temperature-difference series. These series are scanned with moving t, Alexandersson, and Mann Kendall tests. The detected inhomogeneities are adjusted by weighted averages of the regional series. The method is iterative and advances in steps of detection, adjustment, and actualization. Individual inhomogeneous data are discarded and gaps are filled by similar weighted multiple means. For the analysis of the temperature evolution in the Iberian Peninsula, each region is finally represented by one local series and the regional average. The urban effect on minimum temperatures is adjusted by an empirical method, and for Madrid also by a correction derived from new homogenized data. Generally, rigorous homogeneity cannot be achieved because the initial data quality is deficient in many cases and metadata are sparse. Nevertheless, the data homogeneity and quality has been considerably enhanced: the total error margin in a series is of the order of 0.3 C 0.4 C, under consideration of a worst-case error accumulation. On the other hand, the number of inhomogeneities is considerable and their average amplitude is of the order of 1 C reflecting the much larger error margin in the raw data. The homogenized dataset compiled constitutes an important basis for the subsequent detection of thermal changes in Spain in the last 130 years, on a clearly higher confidence level than before. KEY WORDS temperatures; data homogeneity; statistical tests; climate change; Spain; Iberian Peninsula Received 14 July 2005; Accepted 9 December 2006 INTRODUCTION Reliable data are a necessary basis for a study of the evolution of a climatic variable and the detection of changes. In many countries, systematic instrumental weather observations began in the 19th century and since then, the availability of quantitative data has considerably improved. A time series of a climatic variable is called homogeneous when its variations have a climatic origin only (Mitchell et al., 1966). Unfortunately, a vast majority of all climate records is adversely affected by nonclimatic changes in the data. A relocation of an observatory, replacement of instruments, variations in the environment or in reading procedures, as well as human errors in data processing are rather frequent. Under these circumstances, a series suffers artificial biases, most frequently sudden jumps or breaks, and may fail to represent the real climatic evolution. A reliable detection of climate change * Correspondence to: M. Staudt, Departamento de Física Aplicada, Facultad de Ciencias, Campus de Fuentenueva, Universidad de Granada, Granada, Spain. mstaudt@ugr.es is hard or impossible when the error related to data quality is of the same order of magnitude as the signal being sought. The large extent of the data quality problems is well known in the recent literature of climate research. In Chapter 12, the third assessment report of the IPCC (IPCC, 2001) states that The quality of observed data is a vital factor. Homogeneous data series are required with careful adjustments to account for changes in observing system technologies and observing practices. Moreover, Petersonet al. (1998b) point out that Unfortunately, most long-term climatological time series have been affected by a number of non-climatic factors that make these data unrepresentative of the actual climate variation occurring over time. Trenberth (2002) notes that we do not have an adequate climate observing system and There must be an active program of research and analysis utilizing climate data sets to ensure the data are state-of-the-art and meet requirements. Besides observational programmes for improving future data quality, undoubtedly a strong effort must also be dedicated to homogenization and quality control of the existing data.

2 M. STAUDT., M. J. ESTEBAN-PARRA AND Y. CASTRO-DÍEZ An early introductory study on homogeneity and statistical tests was given in Mitchell et al. (1966). They described the problem of achieving absolute homogeneity. Among the more recent efforts in data homogeneization, Goossens and Berger (1986) applied different statistical methods, such as the Mann Kendall test, to the detection of changes in climatic series. Alexandersson (1986) developed the Standard normal homogeneity test (SNHT), applied to the Swedish precipitation series in subsequent work (Alexandersson and Moberg, 1997; Moberg and Alexandersson, 1997). The SNHT is one of the most efficient tests for homogeneity, as Ducré- Robitaille et al. (2003) recently demonstrated. Several homogenization methods have been created and first applied to North American data. Karl and Williams (1987) developed a method that explicitly considers the metadata, detects and adjusts data changes statistically, using adjacent series. This method applies to a large number of North American temperature and precipitation series. Young (1993) and Rhoades and Salinger (1993) presented alternative methods, also based on similar data from highly correlated series. Peterson and Easterling (1994) and Easterling and Peterson (1995), developed a different strategy, with reference series and a Monte Carlo method, and they adjusted the data by least-square linear regressions. The method of Vincent (1998) works with multiple regressions and is applied to daily Canadian temperature series in Vincent and Gullett (1999) and Vincent et al. (2002). In recent studies, attempts have been made to homogenize European data. Slonosky et al. (1999) have created a method with multiple comparisons and adjustments between adjacent series, but without reference series, and have applied it to long-term European pressure series. Their results prove to be similar to those of analytically more sophisticated methods, such as the statistical technique by Mestre (1999). González-Rouco et al. (2001) have homogenized the south-western European precipitation series with an iterative method by extending the strategy of Hanssen-Bauer and Forland (1994). Stepanek (2003) has recently created the software AnClim, especially for the practical application of virtually all relevant homogenization methods for climate data. In a recent international effort on data quality, Wijingaard et al. (2003) analyzed the daily temperature and precipitation data of the European Climate Assessment (ECA). They have found that a vast majority of the series suffer clear homogeneity problems. Nevertheless, among the applications of the homogenization methods in literature, there is still a lack of systematic treatment of Spanish temperature data. The present study carefully prepares these data series, seeking to achieve maximum data quality. The aim is to set a solid base for a reliable subsequent analysis of thermal changes and its confidence levels on a regional scale since the late 19th century. DATA The Spanish temperature data used in this study have been provided by the National Meteorology Institute (INM). The recording of monthly temperatures began sometime between 1869 and 1880 in about 20 observatories, mainly in province capitals (older records are rare), although at some sites the observations were not recorded until the first or second decade of the 20th century. Data quality is problematic or even poor in many cases, because of frequent site changes and data gaps, and metadata are scarce. Figure 1 gives a schematic overview of the temporal data coverage until 1980 and shows the geographic distribution of the observatories. Definition of the regional groups of data series The Spanish monthly temperature series contain a high degree of common variability. The cross-correlations between the anomalies usually exceed 0.5, even at distances of the order of 500 km. Nonetheless, the temperature-anomaly patterns show regional distinctions, as found for the winter maxima by Frías Domínguez et al. (2002). The prior compilation of the data series into climatic groups derives from these regional differences. The basic threefold distinction separates the peninsular mode of thermal evolution that on the one hand includes, geographically, the central plains and the major part of the south, and on the other, the Mediterranean (eastern) and Cantabrian (northern) coastal areas. Furthermore, Galicia, western Andalusia, Extremadura and the Ebro valley are also treated as climatically different groups, in order not to eliminate possible regionally distinctive details of the temperature evolution. A preliminary analysis of the series from the high plains and the Mediterranean did not detect significant differences between the temperature evolutions in their northern and southern regions. The cross-correlations between the anomalies in each regional group systematically exceed 0.6 and clearly confirm the high level of regional synchronicity of the variations, an essential ingredient for the homogenization method. According to these results, the regional groups that will be homogenized separately, without mixing information between them by adjustments, are (the number of series in each group is given in parenthesis): Galicia (6), Cantabria (5), Ebro Valley (4), Mediterranean (6), central high Plains (14), western Andalusia (4) and Extremadura (2). In each of these climatic regions, all the series are homogenized and then the regional mean series (simple mean of the anomalies) is computed a-posteriori. From the homogeneity viewpoint, each individual series could represent its region, but the regional mean series is particularly valuable for subsequent analysis. Hence, each region is going to be represented by one local series and the regional mean, in order to analyze the recurrence of the results, in the sense of coherence among the two representative series: a variability feature is of high authenticity if it appears in both series.

3 LONG-TERM MONTHLY SPANISH TEMPERATURE DATA Figure 1. Scheme of the temporal and spatial coverage of the Spanish maximum and minimum temperature series, between 1860 and 1980 (in more recent years, the coverage is complete, with very few exceptions). The series are: 1. La Coruña, 2. Santiago, 3. Pontevedra, 4. Orense, 5. Vigo, 6. Finisterre, 7. San Sebastián, 8. Bilbao, 9. Santander, 10. Vitoria, 11. Pamplona, 12. Oviedo, 13. Zaragoza, 14. Huesca, 15. Logroño, 16. Teruel, 17. Lérida, 18. Gerona, 19. Barcelona, 20. Castellón, 21. Valencia, 22. Alicante, 23. Murcia, 24. Almería, 25. Burgos, 26. Valladolid, 27. Salamanca, 28. Soria, 29. León, 30. Palencia, 31. Zamora, 32. Ávila, 33. Segovia, 34. Madrid, 35. Guadalajara, 36. Toledo, 37. Cuenca, 38. Albacete, 39. Ciudad Real, 40. Córdoba, 41. Seville, 42. Huelva, 43. Jerez, 44. Málaga, 45. Granada, 46. Jaén, 47. Badajoz, 48. Cáceres. The regional groups are: A) Galicia, B) Cantabria, C) Ebro valley, D) Mediterranean, E) central plains, F) western Andalusia and G) Extremadura. Discarded data due to homogeneity problems The rejection of data or intervals has not been avoided, when the homogeneity problems were too strong to permit an adjustment at an acceptable confidence level. This happens under the following circumstances: Individual data or intervals are discarded, if their difference with at least two (or three) of the other series of the region is extreme at the 95% confidence-level (in an appropriate time interval around these data). Disconnected short intervals (shorter than a decade) with many interruptions are also discarded as well as intervals where more than approximately one-third of the data are missing. Apart from all the available difference series, the anomalies of the candidate series are always thoroughly cross checked. An interval has to be discarded when the available data in a given region do not permit an adjustment of an inhomogeneous break at a satisfactory confidence level (when no other or only one more series is available).

4 M. STAUDT., M. J. ESTEBAN-PARRA AND Y. CASTRO-DÍEZ A whole series is generally discarded when more than five discontinuous breaks (or other clear inhomogeneities to be adjusted) are found. This decision depends also on the length and overall quality of the series (one long series is maintained with six adjusted breaks Table I). Unfortunately, the following long intervals or entire series had to be rejected: In eastern Andalusia, the available long-term series from Jaén and Granada were discarded because their temporal data coverage was unsatisfactory. Table I. Total numbers of data, adjustments (adj.) and rejected (rej.) data (individual data or intervals) in the maximum and minimum temperature series. Series Nr. Nr. Nr. rej. Nr. Nr. Nr. rej. data adj. data data adj. data Maximum temperatures Minimum temperatures La Coruña Santiago Pontevedra Orense Vigo Finisterre San Sebastián Bilbao Santander Vitoria Pamplona y Zaragoza Huesca Logroño Lérida Valencia y Gerona Barcelona y Castellón Alicante y Murcia Madrid Ávila Burgos León y Palencia Salamanca Segovia Soria y Zamora Guadalajara y Toledo Albacete Ciud. Real Cuenca Seville Córdoba Huelva Jerez y Málaga Badajoz Cáceres The data in Galicia before 1880 and the minima in the Mediterranean before 1893 were also rejected because of severe homogeneity problems and/or lack of data. The 19th century data in western Andalusia and Extremadura could not be connected with sufficient confidence to the 20th century data and therefore were not considered. The lost information was partially recovered by defining an average series that consisted of western Andalusia, Extremadura and Málaga, where these data were connected and used. Some individual series or long intervals were rejected because of severe homogeneity problems: the maxima and minima series in Valladolid, the maximum records in Alicante and the minima in Ciudad Real, as well as the minima in Orense until 1949, the maxima in León until 1937, the minima in Guadalajara until 1970 and in Cuenca until METHODOLOGY Statistical properties of the monthly temperature data Temperature records show little variation on spatial scales of hundreds of kilometres, in regions with a regular orography such as the central plains of the Iberian Peninsula and along the coasts, where the crosscorrelations between the monthly series generally exceed a factor of 0.7. Nonetheless, regional differences may be crucial in studies of the temperature evolution and its significance levels. The dataset of the present study is developed not only for a high-confidence analysis of the general trends, but also of the interregional differences in Spanish temperatures. Monthly temperature series show a distinct lack of stationarity, because of frequent trends at time scales of months or several years, which are highly significant in many cases. Schönwiese and Rapp (1997) point out that... short-term trends... become enormously unstable in all seasons, even changing their sign. This variability characteristic complicates the detection of inhomogeneities and requires high significance levels, to avoid an erroneous detection and attribution of inhomogenities. These stationarity properties do not differ significantly among the treated regions and therefore, the same statistical criteria are applied everywhere. The statistical distribution of the temperature data is normal as a good approximation and there is no problem in applying parametric statistics designed for Gaussiandistributed variables, as the t-test or the SNHT. The autocorrelations (serial correlations) in these series are rather slight (coefficients between 0.1 and 0.3) but several statistical tests require corrections (the reduced sample size for the t-test and prewhitening of the series for the Mann Kendall test), in order to achieve realistic confidence levels. The basic homogenization concept The criterion of absolute homogeneity is fulfilled if a climatic series does not include any variability, except

5 LONG-TERM MONTHLY SPANISH TEMPERATURE DATA for the real climatic evolution. However, this condition is almost never fulfilled, because of the problems in real data. Easterling et al. (1996) pointed out that... the real homogeneity of climatic data is irretrievably lost. From the analysis of an individual series, it is generally impossible to decide at a high confidence level whether or not a certain change is inhomogeneous, and the absolutehomogeneity criterion is therefore not applied in the present study. The concept of relative homogeneity developed here is based not on individual series, but on their differences, because the anomalies of highly correlated time series are essentially synchronous. Hence, a local inhomogeneity can be detected in the difference series, where on the other hand an authentic extreme anomaly tends to vanish. This detection method fails if several series suffer a simultaneous data problem (e.g. a common sudden jump). Comparing as many difference series as possible minimizes this risk. The following relative homogenization method is on the basis of multiple comparisons between the climatically similar series within each predefined climatic region. No reference series is defined because the frequent inhomogeneities and missing data do not permit a reliable apriori reference. The whole set of difference series (differences of anomalies) is statistically tested for significant changes (see The scheme of the homogenization method ). Once identified, an inhomogeneous change is adjusted by a weighted mean of the highest-correlated series. The weighting factors depend on the synchronicity (crosscorrelation) and the number of common data of each surrounding series of the same region, relative to the candidate. For an abrupt change, the after : before difference is replaced by this weighted average (see The adjustment algorithm ). The series are adjusted separately in each region, to avoid merging information. This is essential to prepare the dataset for a subsequent detection of regional differences. The scheme of the homogenization method 1. The raw-data series are converted into anomalies, relative to the monthly mean of a given reference period (the final reference is ). The whole set of anomaly difference series is computed within each region (these are more efficient than absolute differences, because in the latter, stronger residuals of the annual cycle remain). Following the idea of multiple comparisons, in a region with n series, n 1 1 i difference series are simultaneously analyzed, in order to detect (and then to adjust) the significant inhomogenities in all series. 2. The suspicious inhomogeneities are marked (mostly abrupt changes or breaks, but also individual extreme data), with particular attention to the metadata information. 3. The largest and most obvious extreme values (outliers) are identified and discarded when the anomalies exceed a certain level (four standard deviations of a running 30-year interval, centerd at each data point, although sometimes, data coverage restricts the detection interval length). This search is based on the difference series, to avoid the rejection of authentic large anomalies. In this step, the criterion is severe and still preserves inhomogeneous data. It removes only the very large inhomogeneous outliers, prior to the closer analysis. 4. The set of difference series is recalculated and the possible abrupt inhomogeneities (breaks) are searched for and classified. Then, for each feature suspected to be inhomogeneous, an appropriate base interval is individually defined for statistical detection and verification. The length of these base intervals is generally years, symmetrically around the possible break-point (if possible) and must strictly avoid temporal overlapping with other inhomogeneities, that would produce skewed results. Besides a reasonable sample size (at least of the order of 100), the socalled station drift must be considered: the differences between highly correlated temperature series are often not stationary, but show frequent trends of changing signs, (even in the absence of site changes or other inhomogeneities, Rhoades and Neill, 1995). Therefore, the base intervals of the candidate series must be shorter if a stronger drift (less stationarity) is present, because earlier or later data are then less valid for the adjustment at a certain time. 5. The statistical tests are applied on the whole set of difference series in the base intervals that have been defined in the previous point. Moving t and SNHT (Alexandersson) tests scan the intervals, to determine the probability of a break, as a function of its time (see The statistical detection of discontinuous inhomogeneities ). Special attention is given to the metadata, by examining first the time intervals around the incidents reported in the literature. But the metadata are scarce and the method considers them, but does not need them. The general detection criterion for an inhomogeneous break is at a level significantly higher than 99% in the t-test, and at least 50% above the 95% level in the SNHT, recurrent in three difference series with highly correlated data. The local anomalies are checked in order to avoid wrong conclusions and, in doubtful cases, the results are subjected to the sequential Mann Kendall test. 6. Once an inhomogeneous break is detected, the adjustment works with a weighted average of the highestcorrelated simultaneous regional data (up to five series). The candidate s after : before difference is replaced by a weighted average of the analogous differences of the correction series. In very few particular and highly significant cases, continuous inhomogeneous features are detected and adjusted by a similar procedure. The method is similar because the detection is performed with the same statistical tests and the adjustment consists of a linear trend that is obtained

6 M. STAUDT., M. J. ESTEBAN-PARRA AND Y. CASTRO-DÍEZ as the weighted mean of the slopes of the highly correlated nearby series. To assure the essential noninterference between the different adjustments, steps 4 6 are executed in an iterative way, although common for all series: after adjusting all the disjointed inhomogeneities in all the series of a region in the first iteration, the set of difference series is recalculated before applying the tests again in the second iteration. This iterative method is necessary, because the correction intervals frequently overlap each other and in this sense, not all breaks and its corrections are independent or disjointed from each other. Furthermore, in some cases, slighter inhomogenities could not be detected until a large inhomogeneity was adjusted and the detection was repeated in the next iteration step (with all data actualized). The iteration stops when no more significant inhomogeneities are detected after actualising the series. 7. A search is made for the individual inhomogeneous data by detecting extreme values (as explained in part F) of the difference series and controlling these data points at each local series. The detected inhomogeneous data are removed. 8. The missing data are filled by weighted means of the best-correlated synchronous data (originally missing data or gaps created by removed inhomogeneous data). The filling algorithm works with up to five regional series and assumes synchronicity between these series (see The replacement of missing data ). 9. Finally, the dataset is prepared with two time series for each climatic region, expressed as anomalies, relative to the reference period : one local series and the regional average (all the local series are also available, for further purposes). The statistical detection of discontinuous inhomogeneities A break is detected when the corresponding significance exceeds the 99% level in the t-test and exceeds the 95% level by 50% in Alexandersson s test, in a recurrent way in at least three cases (three differences series of the same candidate). The windowed t-test. This well known statistical test measures the significance level of a change in the mean, is parametric and assumes normality and serial independence. It is robust against slight deviations from normality, if the sample is large enough (n >20), but significant autocorrelations cause skewed results. The test overestimates the significance when these are positive (common in temperature records). With a first-order autocorrelation coefficient ρ 1, the reduced sample size correction replaces the sample size n by n = n(1 ρ 1 )/(1 + ρ 1 ). This correction is valid, because the memory of monthly temperature records is rather short and its autocorrelations are essentially of a first order. Preliminary experiments show that this correction reducesthe statistical confidence typically by 20 30%. The SNHT (Alexandersson s standard normal homogeneity test). This test by Alexandersson (1986) (initially applied to the Swedish precipitation series) is now frequently used in climatology. It detects a single abrupt change (break) in the mean value of a Gaussian time series, assuming two stationary subseries, before and after the (possible) break, against the null hypothesis of one stationary series. The 95% confidence level for a break is 9.15 for a sample size of 100, rises slightly to 10 for 400 data and to 10.5 for 800 data. As mentioned, the present study requires higher significance levels: the detection will be considered highly significant if the coefficient exceeds the 95% level by 50% (value =15). An example of the detection of an inhomogeneous break is given in Figure 2. The running t-test and SNHT show a similar behavior and confirm a highly significant break at the beginning of the 1980s (the 95% levels are =2 forthet-test and =10 for the SNHT). The SNHT has a sharper peak, due to its quadratic algorithm. The t-test in the 20-year running window gives lower significance levels than in the 40-year interval, because of the smaller sample size. The adjustment algorithm After an inhomogeneity (break) in the candidate series and its time is detected, the adjustment works as follows: Case 1. When there is a sufficiently long overlapping interval (at least 3 years) of the subseries xt 1 and xt 2 around the break point, after verifying the synchronicity of the evolution in both subseries and the absence of clearly inhomogeneous features, the adjustment is made as the mean difference = k 1 k ( t=1 x 1 t xt 2 ) in the overlapping period (t = 1,...,k). Case 2. When a series undergoes a break for nonclimatic reasons, usually there are no overlapping data and the adjustments are based on multiple differences between the candidate series and the highly correlated series of the same region. The cross-correlations ρ j between the candidate series x t and the j available series x j t are computed for these intervals, with corrections for the autocorrelations, if necessary (use of the whitened residuals, after separating an ARIMA-process from the series). Up to k = 5series are chosen for correction under the criterion of highest cross-correlations. Given an adjustment interval of m months, with data x t and x j t at each side of the break at t = τ: {x t,x j t ; t = τ m + 1,...τ; j = 1,...k} and {x t,x j t ; t = τ + 1,...τ + m; j = 1,...k}, the mean after : before difference is computed for each j = 1,...k. With the indices bef = before and aft = after the break, the

7 LONG-TERM MONTHLY SPANISH TEMPERATURE DATA Figure 2. (A) a difference series of maximum temperatures; (B) the coefficients of the t-test with a 20-year running window (discontinuous line) and in the whole 40-year interval (both left axis) and of the SNHT in the 40-year interval (thick line, right axis). partial adjustment j, given by one neighbouring series j, is j = (x j af t x af t ) (x j bef x bef ) = 1/m ( τ+m ) (x j t x t τ )) (x j t x t t=τ+1 t=τ m+1 (1) The total adjustment term is a linear superposition of the k individual offsets j, with the squared crosscorrelations ρj 2 and the coefficients q j, (common data fraction) as weight factors: ( k ) ( k ) = j=1 ρ2 j q j j / j=1 ρ2 j q j (2) This adjustment is applied to the data before the break, because leaving the recent data unchanged is a practical advantage for later updates. The adjustments of the breaks are always based on the whole monthly dataset, but generally, the adjusted value does not depend on the month or the season. Only a few seasonally distinct adjustments are applied, when the seasonal discrepancies are particularly large. In the literature, different types of adjustments can be found (see for example Peterson et al., 1998a). Inhomogeneities in climate data often depend on the month or season, because of the seasonally diverging impacts of instrumental or environmental changes. Hence, an adjustment that depends on the month of the year can theoretically be better. However, it modifies the variability, the autocorrelation structure, and the annual cycles of the data, whereas the adjustments generally performed here consist of a simple additive term. Furthermore, a monthly varying adjustment must work with 12 times fewer data (for a given interval length) and the confidence margins are substantially wider. Hence, adjustments of this type become more attractive when the initial data quality is higher than in the present study. The detection of individual extreme anomalies The extreme anomalies are detected relative to a symmetrically running 30-year interval centerd at each data point. The detection does not work relative to a fixed reference interval, but with a moving window, to determine extreme events relative to the mean temperatures and variability of their adjacent period. All local extreme anomalies are catalogued, but, as in the preceding steps, their differences are crucial for the homogeneity analysis. An anomaly is generally (with few exceptions) considered inhomogeneous when its amplitude exceeds the 2.81 σ - level (99.5% confidence) in at least three difference series between the candidate and the surrounding series. Once the inhomogeneous data is deleted, the gaps can be filled in, as described in the following section. The replacement of missing data Data gaps are frequent in almost all climate data and are an obstacle for an analysis that requires complete series. However, to reject all incomplete series would mean the discarding of almost all data, and therefore a filling strategy for the gaps is necessary. Missing at random (MAR), Little and Rubin, 1987) is a basic condition for missing data (usually presupposed). It is fulfilled when the occurrence probability of a gap at a certain time is independent of the variable s value at this time. This is not the case when certain data are lacking because the values were extreme and could therefore not be measured. The first type of available information for filling gaps is the intrinsic information in the series. The ARIMAmethod (autoregressive integrated moving average, see for example Box and Jenkins, 1976) analyses stationarity and autocorrelations of a series and decomposes it into an ARIMA-series and white-noise residuals. A prediction can be drawn out of the ARIMA-subseries (the best predictor of a white noise is always zero). For monthly temperatures, this method has little predictive potential

8 M. STAUDT., M. J. ESTEBAN-PARRA AND Y. CASTRO-DÍEZ because the white noise component generally explains 70 90% of the variability. Hence, the ARIMA method is used to fill a data gap with the intrinsic information of the candidate series, only when no simultaneous regional data are available. The second information type, the synchronous data of the adjacent series, is more efficient in replacing missing data on account of the high cross-correlations between nearby series. The basic idea is to weight the contribution of each time series according to its confidence level. This is done by a weighted average of the anomalies of up to m = 5 related series. Several points have to be considered for a proper gap-filling algorithm: The most relevant information is again contained in the adjacent years, because the series memory is rather short. Hence, all involved anomalies are computed relative to a symmetric interval (usually 30 years) around the gap. The interpolations are based on standardized anomalies, because standard deviations may differ systematically, even between highly correlated temperature series. The higher the correlation with the candidate, the higher is the weight that will be given to a series contribution. The confidence in a contribution (by a series) that is based on a reduced number of common data with the candidate decreases and so must its weighting factor. If the cross-correlations of the surrounding series with the candidate are weak, the amplitude of the correction is reduced. This means a more cautious gap filling (closer to zero), because in case of complete ignorance, the gap would be filled with an anomaly of zero. Let T c and T j (j = 1,...m) be the temperature anomalies of the candidate and the correction series at a certain time, σ c and σ j their standard deviations and c j the weighting factors. Then, the anomaly to fill in the gap of the candidate series is T s = m j=1 c j σ s σ j T j, while m c j = 1 (3) j=1 The weights c j depends on the squared crosscorrelations ρ 2 j with the candidate and on a common data-parameter q j,sothat m c j = q j ρj 2 / q j ρj 2 (4) j=1 Weak cross-correlations are considered by computing their sum of squares S. If this factor is smaller than unity, the temperature estimation in Equation (3) is multiplied by S. If S is even smaller than 0.5, this method is discarded and the data gap is filled by an ARIMA- interpolation (intrinsic information in the candidate series). After this step, the regional average is computed and the series are now almost ready for a comparative analysis in each region, with minimized homogeneity problems and without gaps. The last factor (given below) to be considered is the urban heat island. THE ADJUSTMENT OF THE URBAN HEAT ISLAND The urban heat Island The small-scale urban warming is a well-known phenomenon in climatology. Its principal causes are the heat-storage capacity of buildings and streets, the quick removal of rainwater, the heat emissions from houses, vehicles and industry, and sometimes a reduced infrared radiative heat loss, due to locally increased atmospheric turbidity. Several studies at different latitudes have found significant thermal differences between urban and rural observatories (Tereshchenko and Filonov, 2001; Figuerola and Mazzeo, 1998; Shahgedanova et al., 1997; Landsberg, 1981; Oke, 1973) and the state of knowledge about the urban heat island is described in Arnfield (2003). An adjustment of the urban effect is necessary to attain realistic results concerning thermal evolution and changes for large cities. The urban effect usually is greatest during the minimum temperatures in the early morning and under anticyclonic conditions (Montávez et al., 2000; Unger, 1996; Colacino and Lavagnini, 1982). Consequently, it also depends on the season: Yagüe et al. (1991) found a strong urban effect in Madrid in summer and a weaker effect in spring. In the present study, the adjustment will be on a long-term basis, without a need to discriminate between seasons or weather types. Hence, the aim is to establish a quantitative relationship between the urban population (as a measure of city size) and urban-rural temperature differences. The empirical urban adjustment Unfortunately, the Spanish data coverage is not sufficient for a study of this relationship on a solid statistical basis, and thus an empirical result is used. After reviewing the literature the aforementioned studies and, moreover, Kukla et al. (1986), Colacino and Rovelli (1983), Moreno García (1994), Portman (1993); Kozuchowski et al. (1994) and Karl et al. (1988) we adopted from the latter study the relation T urb rur = a popul 0.45 urb (5) where popul urb represents the population of the city. This result is based on a large number of data series (more than 1200) and recently confirmed by Englehart and Douglas (2003). The most consistent results were found with a coefficient a = (2.39 ± 0.70) 10 3 K(±95%) for minimum temperatures (Karl et al., 1988). For maximum temperatures, the results did not differ significantly from zero. Furthermore, an urban effect on the maxima was

9 LONG-TERM MONTHLY SPANISH TEMPERATURE DATA neither theoretically well explained nor clearly confirmed in the above works, although Philandras et al. (1999) reported an urban effect in Athens that was stronger in the maximum than in the minimum temperatures. Hence, in the present study, this empirical urban correction is applied to the local representative series of each region, as a function of its population, but only for the minimum temperatures The urban thermal effect is generally weaker in Europe than in northern America and the corresponding adjustment factor of 0.7 (Karl et al., 1988) is applied and discussed for Spain. An alternative adjustment for the minimum temperatures in Madrid For Madrid, an alternative correction for the minimum temperatures is constructed with the data of Madrid Retiro and Toledo. The latter observatory is far enough from Madrid to be outside the urban area, but near enough to have almost identical climatical conditions. Segovia and Ávila are discarded, because a mountain range divides these observatories from Madrid, while Guadalajara is rejected because of its limited data coverage (Table I). The differences of the minimum temperatures of Madrid Toledo show the urban influence, with a clear and highly significant increase between 1930 and 1970 (Figure 3), roughly synchronous with the urban growth of Madrid (the growth of Toledo is considered negligible). The linear regression T urb rur = p + q pop Madrid (6) links the differences in minimum temperature and the population (the parameters p and q are given in Table II). For the minimum temperatures in Madrid, both urban corrections can be compared (The empirical urban correction Table II. Row I: parameters of the linear regression (x) for the differences of Madrid Toledo in minimum temperatures, as a function of the population of Madrid (in millions). Row II: like I, but for the 8-year moving averages; ε is the total error of the estimation in the period The coefficient in row II has a smaller error than the one in row I, due to the compensation of individual anomalies, and has been chosen to compute the adjustment. P( C) q( C/10 6 habit.) ε ( C) (I) ± ± (II) ± ± and the approach with the data of Madrid and Toledo). The resulting coefficient q in Equation (6) defines an urban correction of approximately 0.35 C for each million habitants, whereas the empirical adjustment Equation (5) is larger, applied to Madrid: even under application of the reduction factor of 0.7, the correction is about 1.27 C for the first million habitants, 0.46 C for the second and 0.35 C for the third million. RESULTS The compiled homogenized dataset; adjustments, and rejected data In this study, 43 monthly series of maximum and minimum temperatures, almost all available Spanish long-term series with coverage longer than 30 years, have been organized into seven regional groups and homogenized. The analysis of data quality confirmed widespread homogeneity problems. Adjustments were necessary in almost all series, although the criteria for the detection of inhomogeneities were severe (high significance and Figure 3. Population of Madrid (dotted line, right axis) and 8-year moving averages of the differences of Madrid Toledo in minimum temperatures (continuous line, left axis).

10 M. STAUDT., M. J. ESTEBAN-PARRA AND Y. CASTRO-DÍEZ redundancy levels). In some cases, long intervals (the maxima in León and minima in Guadalajara and Cuenca) or entire series (the maximum temperatures in Valladolid and the minimum temperatures in Ciudad Real) were rejected, because of a lack of homogeneity. On the whole, 59 (85) inhomogenities were adjusted to maximum (minimum) temperatures (Table II), with the mean amplitudes of 1.00 C (1.05 C); in addition, there were many rejected intervals and individual data. On average, one adjustment was made for every 44.5 years (66 years) and a series of years required an approximate mean of two adjustments. The temperature evolution of each region was then represented by its average anomalies and one local series. This dataset maximized the confidence, because all participating series were carefully analyzed and adjusted for homogeneity. Interregional differences ( km scale) in the temperature evolution were resolved, although they were of second order, compared to the common variability at the 1000 km scale. The sub-regional differences (<100 km) were of third order and impossible to resolve with these series, owing to the limited data quality and because the adjustments had intentionally mixed the data within one region. The adjustments applied to the data series consist mainly of corrections of breaks (abrupt changes). Moreover, the series were also scanned for individual inhomogeneous and extreme data, and the gaps in the two representative series of each region are filled. Table II gives an overview of all the adjustments and as an example, Table III lists the details of the adjustment for the maximum and the minimum temperatures in Madrid. As further examples of the results, in Figures 4 and 5, the monthly temperature anomalies before and after the homogeneization process are shown for four different series. Some effects of the homogeneization are clearly visible: in La Coruña (Figure 4(A), (B)), the net warming of the minima was too large, as a consequence of an inhomogeneous break of considerable amplitude in ; in Seville (Figure 4(C), (D)), the 19th century data were rejected because of the lack of simultaneous regional data, an important break in was adjusted and the data of three different series were unified (with adjustments); in the maxima in Madrid (Figure 5(A), (B)), the large break in impedes a reasonable analysis without homogenizing and in the minima in Madrid (Figure 5(C), (D)), the two breaks in and again are important, too. The increase in data homogeneity and quality for climatechange studies (the main goal of this study) is further investigated below in the sections An Estimation of the error Margins of Raw and homogenized Data, and, A Figure 4. Monthly anomalies of the minimum temperatures in La Coruña and the maximum temperatures in Seville: raw data (left side) and homogenized data (right side; all series with distance weighted least squares fits).

11 LONG-TERM MONTHLY SPANISH TEMPERATURE DATA Figure 5. Monthly anomalies of maximum and minimum temperatures in Madrid: raw data (left side) and homogenized data (right side; all series with distance weighted least squares fits). Comparison of some Results, Based on Raw and Homogenized Data. An estimation of the error margins of raw and homogenized data The instrumental error in temperature measurement was of the order of 0.1 C (Linacre, 1992; Servicio Meteorológico Nacional, 1956) and increased to around 0.2 C when differences between two series are concerned (assuming linear error propagation). Any linear homogeneity adjustment based on the same data type added an error of the same amplitude. A long-term series of roughly one century required an average of two adjustments and the mean margin of this error (instrumental plus homogeneization) increased to C, an amplitude of the order of the mean global warming of the 20th century (0.6 C, IPCC, 2001). This comparison illustrates the crucial role of data quality. On the other hand, the mean amplitude of the adjustments (around 1 C) defined the mean error of the inhomogeneities and the uncertainty in the raw-data series, besides the instrumental error of 0.1 C. A large series had between one and five inhomogeneous breaks, with a statistical average of around two. The errors tended to cancel each other or accumulate (partially or entirely). In the latter case, the total error (instrumental plus inhomogeneities) could exceed 1 C, or sometimes even be higher than 2 C. This hampered the detection of climate changes on any reasonable confidence level. The critical role of data homogeneity was confirmed in an extensive analysis of 20th century surface-air temperature and precipitation data from the European Climate Assessment, in Wijngaard et al. (2003). The authors organized the quality of the tested series into the classes useful, doubtful, and suspect. In the period ( ), 94% (61%) of the temperature series are labelled doubtful or suspect. Referring to trends and the variability of weather extremes, the authors state that Clearly, this type of analysis is limited by the degree of inhomogeneity of the data. To compare these statements with the data of the present study, the following paragraph summarizes some comparisons between temperature changes in raw and adjusted data. A comparison of some results, based on raw and homogenized data To compare the thermal changes detected with raw and homogenized data, we applied a t-test

12 M. STAUDT., M. J. ESTEBAN-PARRA AND Y. CASTRO-DÍEZ Table III. Details of the adjustments of the maximum and the minimum temperatures in Madrid. The iteration steps 1 and 3 are there because of the other series, although nothing was done in the Madrid series. The adjusted values are added to all data before the break. The symbol s/n is a signal to noise ratio : the quotient of the adjusted value and the standard deviation of the base interval. Maximum temperatures Minimum temperatures A. Adjustment of inhomogeneous breaks and rejection of inhomogeneous data Iteration step 1 Iteration step 2 break: , adjusted with data of Burgos, anomaly October 1894 rejected. Salamanca, Segovia, Soria and Albacete, base interval: , value: 1.81 C, s/n = break: , adjusted with data of Salamanca, Segovia, Soria, Toledo, Ciudad Real and Cuenca, base interval: , value: C, s/n = 0.45 Iteration step 3 Iteration step 4 break: Nov March 1937, adjusted with data of Burgos, Palencia, Salamanca, Segovia, Toledo, Cuenca; interval: Jan Dec. 1951, value: 0.65 C, s/n = anomaly August 1993 rejected. break: , adjusted with data of Burgos, Salamanca, Segovia, Soria and Albacete, base interval: , value: C, s/n = break: , adjusted with data of Burgos, Salamanca and Albacete, base interval: , value: 0.71 C, s/n = B. Filling of data gaps with data from 4 11 series (of the central plains) and base intervals of approximately 20 years around the missing data. Feb. and Dec. 1875, July 1878, July 1879, July Dec. 1875, July 1897, Oct. 1894, Sept. 1928, 1897, June 1905, June 1922, Nov. Dec. 1936, Nov. Dec. 1936, Jan Feb Apr Oct. 1937, Jan Feb April Oct. 1937, March and April Mar-Apr 1939, Aug , Aug (autocorrelation-corrected) to the temperature means of the first and last 30-year intervals of the 20th century, in six examples (Table IV). After homogenization, the regional net temperature changes were substantially more similar and more consistent between the local and the mean representative series. In several cases, even the qualitative results and their significance levels differed: in the maximum temperatures of the Cantabrian and the Ebro valley and the minima of the Mediterranean, there was a lack of consistency between the raw local and mean series, where only one series showed a highly significant change. In all cases, the degree of consistency in the homogenized data was at least similar, but was usually higher. These results confirmed the substantially larger errors of the raw series and suggested that an analysis based on the raw data in many cases may not be valid if a reasonable confidence level is requested. Furthermore, according to An estimation of the error margins of raw and homogenized data, the homogenization procedure improves the data quality by reducing the error margins (seeeliminating the error of the order of 1 C, due to the inhomogeneities) and is strongly recommended as a previous step, before analysing the data. The empirical urban correction and the approach with the data of Madrid and Toledo To test the performance of both urban corrections for Madrid, the differences between the average series of central Spain (without Madrid) and Madrid were compared (Figure 6). The average series stemmed from medium-sized towns with an average population of around , for which the urban effect was negligible, compared to Madrid. The decreasing trend of the differences without any urban correction was, at least partly, owing to the urban effect (the climatic differences in central Spain were not large). With the empirical correction, this trend was reversed, signifying over-adjustment of the urban effect. The series C, with the alternative adjustment (Madrid Toledo) still showed an increase, but clearly weaker and less significant than B, indicating a more realistic correction, although slightly too great. The urban effect in Madrid was smaller than it would be theoretically, following the population data and the comparison with Toledo. The Madrid data were compiled from the Retiro park observatory, located in the urban centre, but close to the edge of this green area of about 1.2 km 2. The minima at dawn were very probably lowered, thus attenuating the urban effect. García Hernández et al. (1997) stated a clear influence of