Individual Equity Return Data From Thomson Datastream: Handle with Care! December 2003 PRELIMINARY (Please do not quote without permission) * PO Box 117168, 321 Stuzin Hall, Gainesville, FL 32611-7168. Email: ozgur.ince@cba.ufl.edu and burt.porter@cba.ufl.edu. Phone: (352) 392-8928. We would like to thank Ralf Elsas... for their many helpful comments. Any remaining errors remain ours alone.
Abstract We compare individual equity return data from Thomson Datastream (TDS) for one large national equity market, the United States, to the source most often used by academics, the Center for Research in Security Prices (CRSP) for the period 1975-2002 in order to evaluate the suitability of TDS for use in studies involving large numbers of individual equities in markets outside the U.S. We discover important issues of coverage, classification, and data integrity and find that naive use of TDS data can have a large impact on economic inferences, particularly early in the sample period and among smaller stocks. We show that after careful screening of the TDS data that although differences remain, inferences drawn from TDS data are similar to those drawn from CRSP. 1
I. Introduction International asset pricing occupies a prominent position in the finance literature. From a U.S. perspective, non-u.s. equity markets provide an opportunity to verify results from tests using U.S. data. The study of all markets is also interesting in its own right. Studies of market integration, market comovement, the benefits from international diversification etc., add to our understanding of finance in an important way. A necessary condition for conducting such research is the availability of high quality equity return data. There exist many sources for non US equity return data including that maintained by the Pacific-Basin Research Center (PACAP) for eight Asian markets beginning in 1975 as well as the individual markets themselves. Alternatively, many researchers have used Thomson Datastream (TDS) for its broad and deep coverage. We know of no current alternative to TDS in terms of number of markets covered and stocks covered in each market. We evaluate the use of Thomson Datastream data for academic research by comparing TDS data for U.S. equities to the "standard" academic source, The Center for Research in Securities Prices (CRSP). The CRSP data is maintained specifically for research of US equity markets so is an appropriate standard. We are not evaluating TDS vs. CRSP per se; rather we use the comparison between the two databases to identify issues that may be relevant in the use of TDS data for non-u.s. equities. In all of what follows we never use CRSP to make corrections to TDS, rather we screen the TDS data independently then compare the results to CRSP to see how well our proposed screens perform. Since users of international TDS equity data rarely have an independent source available, the procedures we develop must not require an independent data source in order to be of practical use. To our knowledge, this is the first formal examination of the TDS equity return data as a research database even though several papers make use of worldwide equity return data from this source. Examples include Griffin, Ji, and Martin (2003) and Naranjo and Porter (2003) who examine the interaction between country neutral momentum strategies, Griffin (2002) who examines whether country-specific or global versions of Fama and 2
French's three factor model better explain time-series variation in international stock returns, and Porter (2003) who investigates the interaction between market-wide liquidity shocks in national equity markets. Many authors use Thomson Datastream to compile samples of all stocks traded within a national market. Examples include Clare and Priestley (1998) for Malaysian stocks, Brooks, Faff, and Fry (2001) for Australia, Pinfold, Wilson and Li for New Zealand, Hiller and Marshal (2002) in the U.K., Lau, Lee, and McInish (2002) for Singapore and Malaysia, and Elsas (2003) for Germany. We focus on issues of coverage, classification, and data integrity. We begin by downloading price, shares outstanding, and total return data for all equities traded in the U.S. and included by TDS in their research lists and lists of equities that are no longer traded (dead) for the period 1975-2002. We compare this data to the CRSP universe during the same time period. Our investigation reveals several problems with using TDS data for research involving broad market coverage. Most troubling is the inability to easily distinguish between the various types of securities traded on equity exchanges. We also find that classification variables often reflect only the most current values. For example, a security that begins trading on the Nasdaq NMS and later delists and begins trading on the non- Nasdaq OTC market would be classified as a non-nasdaq OTC security by TDS throughout the sample period. We also identify several issues with calculating total returns using return variables provided by TDS. Most of the problems identified in this paper are concentrated among the smaller size deciles calculated using NYSE breakpoints. We illustrate the effects of these problems on inferences by reporting sample statistics on size decile portfolios and by reporting the profits from simple momentum strategies. It is well known that portfolios short recent losers and long recent winners will be concentrated in smaller stocks since small stocks tend to have higher variance; therefore data problems with calculating returns of small stocks will likely show up in momentum portfolio returns. We find that the well documented momentum effect in returns is not detectable in the raw TDS data. 3
We screen the TDS data in two steps. First we attempt to identify the non-common equity securities included in our TDS sample. Second we run a series of screens to identify 'unusual' return patterns and either replace the returns in question using information contained in other TDS variables or drop the observations from our sample. Although we develop our rules for screening observations using only information from TDS, we verify using CRSP that our screens do not drop valid observations. We give an overview of the Thomson Datastream data in Section II, and document our extraction methods. Section III compares the coverage of TDS and CRSP in the U.S. Section IV identifies idiosyncratic problems with using TDS return data, Section V compares dividend data from CRSP and TDS and Section VI summarizes our findings and concludes. II. Datastream Overview Thomson Datastream (TDS) has price, volume, market capitalization and dividend data for approximately 50,000 equities covering 64 developed and emerging markets with up to 25 years of data. There is also considerable accounting, fixed income, index, commodity, macroeconomic time series, interest rate, and exchange rate data available, although none of this is discussed in this paper. To download security data we make use of constituent lists. TDS constituent lists are maintained by TDS and contain all firms in an industry, sector or market. Each list contains the TDS identification numbers of all firms that are part of the list. We use lists FAMERA FAMERZ (one list for each letter of the alphabet) for equities currently trading in the U.S. and DEADUS1 DEADUS6 for equities that are no longer traded. We download daily data for all days between 1/1/1975 and 12/31/2002 and create monthly returns from end-of-month daily data 1. Table 1 lists the TDS variables we use and their definitions. For comparison we use the entire CRSP universe for the same time period including delisting returns and partial period data. 1 This yields exactly the same return series as requesting monthly frequency data. We request the more detailed data to help us in developing rules to screen the data. 4
Extracting a large volume of data from TDS can take many days due to limitations on how much data can be extracted in a day. The length of time required along with the constant updating nature of the data can cause some difficulties. For example, we download the current data first followed by the dead equities, otherwise a firm that ceases trading while the data is being extracted will be lost. The approach used by TDS and CRSP when a user requests data after a firm ceases trading is different. CRSP will report no data whereas TDS reports the last valid data point. TDS pads the time period after the firm ceases trading with constant values equal to the last month (or day) that the firm traded. To identify and eliminate these dummy records we delete all monthly observations from TDS from the end of the sample to the first non-zero return. We realize that a small number of valid zero return observations may be lost at the end of the sample 2. Table 2 provides summary statistics on the data from the two sources with nonmissing return data. We have 22,832 unique permnos (CRSP Permanent Issue Number) and 2,256,605 monthly observations from CRSP and 21,245 unique TDS identifiers and 2,048,255 observations from Thomson Datastream. Of the CRSP observations, 1,941,744 or 86% are share code 10 and 11 defined as common equity of U.S. based companies. Most market studies using CRSP data restrict themselves to these share codes. Of the TDS observations, 2,002,459 or 98% have TYPE equal to EQ (Equity). Within common stock, CRSP has 503,107 monthly NYSE observations compared to TDS with 946,940, or almost twice as many as CRSP. As we will show, most of this discrepancy is due to the inclusion of non-common equity securities by TDS that are traded on the NYSE. Somewhat surprisingly, there are fewer TDS observations associated with AMEX (124,521) and Nasdaq (472,398) than CRSP observations on the same exchanges (230,497 and 1,208,137 respectively). 2 TDS lists a variable "TIME" defined as "date of last equity price data", however a random check of several securities shows this variable to be uninformative for U.S. equities. In many cases the variable value is #N/A (for example: Integrated Silicon Systems) or the value does not coincide with actual return data available on both TDS and CRSP (see EMS Systems whose value for the TDS variable TIME is 12/29/1989 but has valid CRSP and TDS data through May (CRSP) and April (TDS) of 1990.) 5
We show the potential impact on inferences by calculating equal-weighted market returns, equal-weighted returns by exchange, size decile returns, and the returns to two momentum trading strategies. Our CRSP dataset for this exercise contains all equities with share code equal to 10 or 11 (common equity) and that are traded on the NYSE, AMEX, or Nasdaq exchanges. The TDS dataset contains all securities of type 'EQ' (equity) and have an exchange identifier of NYSE, AMEX, Nasdaq-NMS, or NasdaqnonNMS. No other data screens or checks have been used. Table 3 presents the results. The TDS equal-weighted average market return of 2.40% per month is 72% higher than the comparable CRSP average return of 1.41% per month. The time series correlation of the equal-weighted market return series is 0.66. The value-weighted market returns are more similar with nearly identical mean returns and a time series correlation of 0.998, implying the differences between the two datasets is concentrated among smaller issues. Comparing equal-weighted returns by market we see that the biggest difference is among AMEX firms, although as we will see later this is due in large part to errors in the return data. Mean returns calculated from TDS are also much higher than those calculated from CRSP for both NYSE and Nasdaq firms. The NYSE return series have a correlation of 0.84 and the Nasdaq series have a correlation of 0.93. Comparing size decile returns we see the largest differences in the smaller deciles. The momentum trading strategy results are consistent with the large disparity in the smaller decile returns between the two data sources. Using CRSP data, a strategy long the top 10% of firms ranked on average return over months t-2 through t-12 and short the bottom 10% and held for one month before rebalancing, referred to in the table as a 1090 strategy, earns an average monthly return of 1.13% with an associated t-statistic of 2.86. A comparable strategy using TDS data results in an average of 0.26% per month and we cannot reject the null that the average return is zero. The results from a 3070 strategy are even more different with the return calculated from CRSP data equal to and average of 0.95% per month with an associated t-statistic of 3.65 while the average return calculated from TDS is negative. 6
It is clear that there are important differences between the two data sources and that these differences are concentrated in the smaller size deciles. In the next section we explore differences in coverage between the two data sources and discuss a method of screening the TDS database for securities that researchers may wish to exclude. III. Coverage To isolate the differences in coverage between the two data sources we match the databases security by security using the last firm observation in each year between 1975 and 2002. We link securities using combinations of CUSIP, ticker symbol, and name. We manually verify a sample of matching firms and nonmatching firms to confirm the quality of our matching process. Table 4 summarizes the results of our matching exercise. We are able to match 60% of December CRSP observations with share code 10 and 11 to December TDS observations. The rate at which we match CRSP NYSE common equity (69%) is slightly higher than for either AMEX (63%) or Nasdaq (57%). The matching is much better later in the sample period than in earlier years. Figure 1 summarizes the fraction of CRSP permnos that are also found in TDS in December of each year. Approximately 20% of the CRSP sample is also in TDS in December of 1975 and this fraction rises steadily throughout the sample reaching almost 90% by December of 2002. Of the December 2002 CRSP observations that we are unable to match to TDS, approximately half are ADRs (share codes 30 through 39 for which TDS maintains separate constituent lists) and the remainder are firms that are either absent from the TDS constituent lists or exist on TDS with different CUSIP numbers than on CRSP. We are surprised that not all firms that cease trading are included on the TDS constituent lists of inactive firms, DEADUS1 through DEADUS6, and therefore do not appear in our sample. Using the TDS interactive utility, Advance Version 4.0, we are able to locate several large firms that have ceased trading and are not included on the 7
dead constituent lists. Examples include such well known names such as Atlantic Richfield Co., GTE Corp, and Honeywell. Figure 2 summarizes the fraction of TDS identifiers with TYPE equal to 'EQ' that are also found in CRSP in December of each year. Approximately 70-80% of the TDS sample is also on CRSP until the mid 1990s when the fraction steadily falls until only 55% of the TDS sample is also on CRSP in December of 2002. The large number of TDS identifiers with no corresponding CRSP permnos, especially late in the sample period, is due in large part to the fact that TDS includes many securities with a type indicator of "equity" that are not common stock of U.S. firms. Such securities include stock of firms incorporated outside the U.S., closed end funds, REITs, ADRs (although there are very few ADRs on the TDS equity lists since there are specific TDS constituent lists for ADRs), Shares of Beneficial Interest, and traded partnership units. Researchers using CRSP data commonly restrict the sample to share codes 10 and 11, however there is no simple method for performing the same screen with TDS. Since the only other source of information about the security is the variable NAME, we search the NAME variable for key words or phrases that may indicate the security is not common equity. Our procedure is to search the name field for key phrases, create a candidate list of firms for removal by extracting all observations containing those phrases, and then review the list of observations for any firms which should not be removed from the sample. For example, we search for the letter combinations 'pf' and 'pref' to identify preferred stock, but explicitly prevent removing 'Pfizer'. We use the TDS variable GEOG to remove any firm incorporated outside the U.S and the EXMNEM variable to exclude any firm not traded on the NYSE, AMEX, or Nasdaq. Our screening process reduces the number of TDS observations from 2,002,459 to 1,267,218, a reduction in sample size of 37%. We repeat our calculation of market portfolio returns and momentum portfolio returns using the TDS screened sample and compare the results to our CRSP sample. The third set of columns of Table 3 reports our results. The results are similar to the unscreened sample implying that the large differences in market returns, 8
size decile returns, and momentum returns are not due solely to the inclusion of securities other than common equity by TDS. IV. TDS Data Issues Our goal is to develop methods for identifying data errors in TDS than can be used in markets outside the U.S. for which an alternative data source is not readily available to the researcher. In developing these rules we make extensive comparisons of CRSP and TDS matched data but we take great care that no screen or correction we develop would require the use of such an outside source. Several TDS data errors we identify would be difficult, if not impossible to identify without an alternative data source. For example, in June of 1992 Big O Tires, Inc (permno=92508) conducted a 1:5 reverse stock split that is reflected in the shares outstanding and closing price from CRSP. The unadjusted price series in TDS matches that in CRSP, including the large rise in price level in 6/1992, however the change in shares outstanding and adjusted price is in 6/1990 resulting in an incorrect return index and return in June of 1990 and 1992 and an incorrect shares and market value for the full two year period. To be fair, TDS often does a better job than CRSP in reflecting capital structure changes. For example, TDS will often reflect a seasoned equity offering on or very near the day of the offering, however CRSP will not reflect the additional shares or the change in market capitalization until the end of the quarter or fiscal year 3. For example, Nashville Country Club, Inc (now known as TBA Entertainment, CRSP permno=80256) offered shares in a seasoned offering in April of 1996 but the additional shares are not reflected in CRSP until 12/27/1996. The TDS data reflects the additional shares in May of 1996. Since market value is derived from shares outstanding, the CRSP market capitalization for this firm is incorrect for the eight month interval. There are other differences in which it is not clear which data source is 'correct'. The closing prices used by each source often do not agree. For example, according to CRSP, 3 Thank-you to Jay Ritter for providing an alternative source of SEO offer dates and share quantities. 9
Apogee Technology Inc, closed in May of 1990 at $4.625 and in June at $9.75 for a return of 110.81%. The same firm is listed in TDS closing in May at $4.00 and in June at $9.50, for a return of 137.50%, a difference of 26.69%. Note also that CRSP maintains prices in increments as small as 1/64 while TDS rounds all prices to the nearest penny resulting in differences in return, particularly for low priced stocks. Both CRSP and TDS report closing price as a bid/ask average on days in which the stock does not trade. To check for errors in return calculated from changes in the total return index, we calculate returns using price and dividend data and compare it to the percentage change in the return index. We only compare the two returns in months in which the ratio of adjusted price to unadjusted price is the same as the previous month in order to prevent differences in the two return calculations from being due to a capital structure change. The TDS practice of rounding prices to the nearest penny can cause non trivial differences in the calculated return when prices are small, so we drop all observations in both TDS and CRSP when the end of previous month price is less than $1.00. A related problem is the discreetness of the TDS total return index. The return index is reported to the nearest tenth so when the return index is very small, discreetness becomes important. To see why this is true, consider Firepond Inc. in October and November of 2001. According to TDS, Firepond closed at $4.70 in September, $7.89 in October and $8.00 in November. The corresponding values of the total return index are 0.5, 0.8, and 0.8. No dividends or capital changes occurred in this period. The returns calculated from price changes are 67.87% and 1.39% whereas the returns calculated from the return index changes are 60.00% and 0.00%. In these cases we substitute return calculated directly from prices for returns calculated from the return index. Suspension of trading is handled differently by the two sources. CRSP reports missing values for prices and daily returns, however while monthly returns are reported as missing if trading is suspended at the end of the month, the return for the first month after trading resumes is calculated using the last available end of month price, even if the intervening time interval is long, and without accounting for the multiperiod nature of the return. For example, Ormand Industries (permno=34905) stopped trading on 5/31/1990 10
and resumed trading on 9/19/1990. The September return is calculated from the end of month price, 0.68750 and the last valid end of month price, 4/30/1990, of 0.43750, resulting in a simple 1-month reported return for September of 57.14%. TDS reports sporadic trades during this period with changing prices. The way in which CRSP calculates returns after the resumption of trading and the difficulty of identifying trading halts on TDS can cause large difference in monthly returns between the two sources. Since we are unable to identify trading suspensions using only TDS data, we make no corrections for this problem. We identify many instances of data errors. According to TDS, in the first eight months of 1995, Magellan Petroleum Corp never has a daily closing price above $2.38 but the closing prices for 7/31, 8/1, and 8/2 are all above $13.60. On 8/3 the price reverts to $1.88. The closing prices on the three days in question on CRSP are 1.9375, 1.8750, and 1.9375. The resulting monthly TDS return for July is 626.69% vs. a CRSP reported return of 0.00%. We screen for such occurrences by setting any return above 300% that is reversed within one month to missing. After screening the TDS equity data for non common equity securities and searching for data errors as described above we recalculate the portfolio returns for the same portfolios reported in Table 3. The results are reported in Table 5. We report revised CRSP results as well because we have dropped CRSP observations with previous month price less than $1.00. In calculating momentum returns we only enforce the price restriction during the portfolio formation period and not during the holding period. The TDS portfolio returns are now much closer to those calculated from CRSP. The average CRSP equal-weighted market return is 1.29% per month compared to the TDS equal-weighted market return of 1.51%. The correlation between the two equal-weighted market indexes is 0.995 and the correlation between value-weighted indices is 0.998. The individual market return means and standard deviations are also similar and the correlations are high. The momentum returns that for TDS were insignificant and sometimes negative are now positive, significant, and highly correlated with the 11
momentum returns calculated from CRSP 4. In unreported results, we delete all observations not common to both datasets and calculate all of the portfolio returns. Although differences remain, they are generally quite small. There are several reasons why we should not expect the CRSP and TDS results reported in Table 5 to be identical. First is the issue of coverage. Not only will this affect the average market returns but also the NYSE size breakpoints. In addition, the issue of classification errors will induce a survivorship bias in a TDS sample of NYSE/AMEX/Nasdaq firms. Since firms with poor returns are more likely to be delisted and TDS captures only the most recently available exchange information, firms that delist from the major exchanges and trade over-the-counter will be excluded from the TDS sample raising the average return of the firms that remain. We illustrate the survivorship issue by calculating life expectancy for every year in each sample. In January of each year, for all firms with valid observations in that month, we estimate the life expectancy of a firm by averaging the number of months that each firm remains in the sample. The 'life' of a firm has a maximum value equal to the number of months remaining before December of 2002. Panel A of Table 6 reports the results. In every year the average number of months remaining is larger for TDS than for CRSP implying that firms that delist are less likely to be included in the TDS sample. A nonparametric Wilcoxian rank-sum test for difference in mean easily rejects in every year. In addition, the issue of classification makes it difficult to identify NYSE firms from which the breakpoints are calculated, particularly early in the sample period. Table 7 lists the breakpoints calculated at the end of 1975 and 2001 calculated for stocks classified as trading on NYSE for each of TDS and CRSP. The first set of columns list breakpoints and the number of firms/month observations falling in each decile using CRSP, the second set of columns list breakpoints and observations calculated from the 'raw' TDS 4 The CRSP 1090 momentum return of 1.97% per month is very high by the standards of the literature, however this value is not due only to dropping firm with prices less than $1.00 during the portfolio formation period. Restricting the sample to observations that exist on both CRSP and TDS lowers the CRSP 1090 momentum return to 1.38% per month. 12
data and the third set of columns refer to the TDS data after screening for non U.S. and non common equity securities. In December of 2001, the CRSP and screened TDS size decile breakpoints and equity counts are very similar. The difference in breakpoints between the raw and screened TDS samples show that most of the screened securities have very small market capitalization. This is also reflected in the average market capitalization figures. The size breakpoints are very different between the samples in December of 1975. Interestingly, the number of firms from which the breakpoints are calculated is higher for the screened and corrected TDS sample (2044) than for the CRSP sample (1429). The smaller average NYSE market capitalization figure combined with the larger number of observations and the smaller breakpoints implies that the additional firms are quite small. We believe this is due to stale exchange information. For CRSP, the ratio of the number of observations in decile 1 to the number of observations in decile 2 for 1976 is over 5:1 because the average Nasdaq/AMEX firm is much smaller than the average NYSE firm. The comparable ratio for TDS for 1976 is only 1.1:1. Taken together, these facts suggest that the TDS size breakpoints have not been calculated only from stocks that traded on the NYSE at the end of 1975. By the last year of the sample period the breakpoints and distribution of firms by decile are much more similar. V. Dividends We also compare the dividend information provided by CRSP and TDS. We compare CRSP dividends coded as ordinary or liquidating cash dividends to all TDS dividends. We use the TDS dividend adjusted for capital changes and recover the original dividend amount by multiplying the adjusted dividend by the ratio of unadjusted price to adjusted price. First we examine the common set of observations and find that of 136,353 firm months in which either CRSP or TDS show a dividend payment, 127,236 or 93.31% of the firm months show identical dividend amounts from each source. 8,215 dividend observations or 6.03% disagree by the dividend payment amount and 902 observations or 0.66% have non zero values for TDS dividends but unadjusted prices are missing so 13
dividends before any capital changes cannot be calculated. Of the 8,215 observations that disagree as to the dividend amount, 68% have positive dividends payments according to CRSP and zero according to TDS. 13% have zero dividends according to CRSP and positive according to TDS with the remainder showing positive dividends on each source but disagreeing on the dividend amount. Many of the observations that show positive dividends on CRSP and zero dividends on TDS are for firms paying regular dividends. For example, the NYSE listed firm American Can Co., later renamed Primerica Corporation (CRSP permno=10241) paid a quarterly dividend every quarter from 1Q75 through 3Q88 in per share amounts from $0.40 to $0.725 per share, however the first dividend reflected in TDS is in January of 1987. We calculate market dividend yields as the sum of all dividends paid during the previous year calculated as per share dividend times shares outstanding computed from market value and price, divided by the sum of all firm's market values. Figure 3 plots the monthly dividend yields for the combined NYSE/AMEX/Nasdaq sample. Although the time series of the two dividends yields is similar throughout the sample period, the fit is better in the latter half. The common sample dividends yields have a correlation of 0.996. We recalculate market dividend yields without restricting the sample to matched observations. Figure 4 plots the results. The CRSP dividend yield is higher than the TDS dividend yield in the first half of the sample although they do move together. In the second half there appears to be little difference in the two measures. The correlation of the two measures of the market dividend yield is 0.982. VI. Conclusion Thomson Datastream is a rich data source containing equity return data for approximately 50,000 equities in 64 developed and emerging markets with up to 25 years of data; however, issues of classification, coverage, and data integrity require that care be used. We compare Thomson Datastream (TDS) data for U.S. equities to data from the 14
Center for Research in Securities Prices (CRSP) in order to identify features of the TDS data that might cause errors in inference for the unwary researcher. We find that TDS includes data for many securities with type equal to 'EQ' (equity) that the researcher may wish to exclude from her sample. Examples of such securities include preferred stock, traded warrants, REITs, closed-end funds, exchange traded funds, and shares of beneficial interest, however to the best of our knowledge there is no simple method for classifying these securities. By scanning the security name field for clues as to the security type, we are able to identify over 35% of the monthly observations as not being common equity. We also find several errors related to the country constituent lists maintained by TDS. We identify several examples of large firms for which TDS maintains data but that are not included on the appropriate constituent list and hence will not be downloaded by the researcher. Since we can only check for missing firms manually, by identifying firms that exist on CRSP and are not in the data we download from TDS, we are not sure how common this problem is, however we do not have trouble finding several large firms that are not on the TDS lists of non-traded (dead) firms. We also have no way of knowing how common this problem is in other markets. We also find that the exchange information provided by TDS usually applies only to the exchange on which the security is trading when data is downloaded, or for securities that are no longer traded, the last available exchange. This causes several problems. First, if the researcher wished to include only securities traded on the major exchange(s) of a particular country then the sample may include a survivorship bias. Since poorly performing firms are those most likely to delist and trade over-the-counter, the remaining firms are likely to have higher average returns. Second, for countries such as the U.S. with multiple major exchanges, methods such as the using of NYSE determined size breakpoints can be problematic, particularly the further back in time you go. We identify many instances of errors in the return data. We compare returns calculated from changes in the TDS total return index to returns calculated from price and 15
dividend data and either drop observations in which there is a large discrepancy or substitute the return we calculate for the return calculated from the change in the return index. After screening the data for non-common equity and obvious errors in the data, we find that market-wide, exchange, and decile portfolio returns are quite similar between TDS and CRSP. We also find positive profits to momentum trading strategies using both the CRSP data and the screened and corrected TDS data that are statistically significant and highly correlated. However, the means are quite different but this is not surprising considering the large discrepancies in coverage, particularly early in the sample period. In our final judgment, TDS provides an excellent source of equity return data, however the researcher must take great care to screen and correct the data. We argue that failure to do so can result in very misleading inferences being drawn from tests using these data. 16
References Brooks, Robert D., Robert W. Faff, and Tim R.L. Fry (2001), GARCH Modeling of Individual Stock Data: the Impact of Censoring, Firm Size and Trading Volume, Journal of International financial Markets, Institutions and Money 11, pp. 215-222. Clare, Andrew D., and Richard Priestley (1998), Risk Factors in the Malaysian Stock Market, Pacific-Basin Finance Journal 6, pp. 103-114. Elsas, Ralf (2003), Bank debt vs. public debt of German companies, University of Florida Working Paper. Griffin, John M. (2002), Are the Fama and French Factors Global or Country Specific?, The Review of Financial Studies 15, pp 783-803. Griffin, John M., Susan Ji, and Spencer Martin (2003), Momentum Investing and Business Cycle Risks: Evidence from Pole to Pole, The Journal of Finance, December 2003. Hiller, David and Andrew Marshall (2002), Insider Trading, Tax-Loss Selling, and the Turn-of-the-year Effect, International Review of Financial Analysis 11, pp. 73-84. Lau, Sie Ting, Chee Tong Lee, and Thomas H. McInish (2002), Stock Returns and Beta, Firms' Size, E/P, CF/P, Book-to-market, and Sales Growth: Evidence from Singapore and Malaysia, Journal of Multinational Financial Management 12, pp. 207-222. Naranjo, Andy and R. Burt Porter (2003), International Momentum Strategies: Profitability and Cross-Country Relationships, University of Florida working paper. Pinfold, John F., William R. Wilson, and Qiuli Li (2001), Book-to-Market and Size as Determinants of Returns in Small Illiquid Markets: the New Zealand Case, Financial Services Review 10, pp. 291-302. Porter, R. Burt (2003), Market-wide Liquidity Shocks in International Markets, University of Florida working paper. 17
Table 1 Variable Definitions This table lists the subset of available Thomson Datastream (TDS) variables examined in this paper. Variable names, mnemonics and descriptions are from TDS. Variable Variable Name Mnemonic Description Mnemonic MNEM Unique identification code assigned by Datastream Datastream code DSCD Unique six digit identifier for every stock Type of Instrument TYPE 'EQ' for equity Name NAME The name of the security/company Geographical Grouping GEOG Code identifying the home country of the company Exchange Code EXMNEM The ISO standard exchange code that identifies the default source of price data. Closing Price P Closing Price adjusted for any subsequent "capital actions". Unadjusted Price UP Closing Price, unadjusted for dividends or splits Return Index RI Change in RI is the total return to holding the stock including capital gains and dividends Market Value MV Closing Price x Number of Shares in Issue Turnover by Volume VO Number of shares in thousands traded on a given day reported by the primary exchange for the stock Local Code LOC For U.S. securities this is the CUSIP Dividend DDE Dividend Rate, Adjusted, based upon ex-date
Table 2 Comparative Statistics This table lists the number of monthly observations and unique security identifiers available in the 1975-2002 data from the Center for Research in Securities Prices (CRSP) and Thomson Datastream (TDS). The CRSP identifier is the Permanent Issue Number (permno) and the Datastream code (DSCD) for TDS. We download all available CRSP data for the time period and list counts by share code and observations by exchange for share codes 10&11 (common equity). We download all available TDS data using TDS constituent lists FAMERA FAMERZ for currently traded U.S. equities and DEADUS1-DEADUS6 for securities that are no longer traded. We list counts by type and for type equal to equity, by exchange. Subcategories of unique identifiers do not sum to overall counts because of changes in the value of classification variables in the time series of unique identifier. CRSP TDS Monthly Obs. Unique Identifiers Monthly Obs. Unique Identifiers Total number of observation in sample 1975-2002 2,256,605 22,832 2,048,255 21,245 Share Code Share Code Description TYPE TYPE Description 10-11 Common stock 1,941,744 19,331 missing 1,430 27 12 Common, incorporated outside U.S. 85,233 1,141 EQ Equity 2,002,459 20,394 13 Common, americus trust components 3,196 54 ADR American Depository Receipt 21,767 382 14 Closed end funds 66,927 664 UT Unit Trust 22,599 466 15 Closed end funds, incorp. outside US 567 3 Total 2,048,255 21,269 18 REITs 28,277 293 20-24 Certificates 1,849 18 30 ADRs 65,326 764 40-48 SBIs (all) 42,845 500 70-78 Units (all) 20,641 263 Total 2,256,605 23,031 Exchange Code Exchange Code Description Monthly Obs. Unique Identifiers Exchange Code Exchange Code Description Monthly Obs. Unique Identifiers 1NYSE 503,107 3,966 NYS NYSE 946,940 7,871 2AMEX 230,497 2,637 ASE AMEX 124,521 1,084 3Nasdaq 1,208,137 15,242 NMS Nasdaq/NMS 383,283 3,665 0No exchange listed 3 3 NAS Nasdaq/non NMS 89,115 917 Total 1,941,744 21,848 OTC Non-Nasdaq OTC 211,074 3,262 XBQ OTC Bulletin Board 235,720 3,477 Other U.S. 1,705 18 Missing or Unknown 5,757 129 Non-US 4,344 97 Total 2,002,459 20,520 19
Table 3 Portfolio Returns Center for Research in Securities Prices (CRSP) portfolio are common equity traded on NYSE/AMEX/Nasdaq. Thomson Datastream (TDS) are all securities on constituent lists FAMERA-FAMERZ and DEADUS1-DEADUS6 (32 lists total) with type equal to equity and exchange mnemonic of NYSE, AMEX, Nasdaq-NMS and Nasdaq-NonNMS. Screened TDS is TDS screened for non-common equity securities using the procedure described in the body of the paper. All portfolios are equal-weighted except as noted in table. Size deciles are calculated in December of each year using all NYSE securities. 1090 Momentum refers to the average monthly return of a strategy long past winners defined as the top 10% of stocks sorted by return over months t-2 through t-12, and short past losers. Similarly for 3070 except winners and losers are defined as the top 30/ bottom 30%. t-statistics are in parentheses. Monthly returns, 1975-2002 CRSP TDS Screened TDS Average σ Average σ ρ Average σ ρ Equal-weighted Market Return 1.41 5.69 2.40 7.53 0.66 2.67 9.10 0.61 Value-weighted Market Return 1.13 4.57 1.14 4.40 1.00 1.16 4.49 1.00 NYSE 1.35 5.00 2.00 5.35 0.80 2.24 6.54 0.74 AMEX 1.42 6.16 6.95 88.90 0.11 8.19 106.15 0.10 NMSNAS 1.45 6.17 2.54 6.24 0.94 2.55 6.34 0.94 Decile 1 (smallest) 1.60 6.44 7.15 14.69 0.34 11.27 76.62 0.12 Decile 2 1.32 6.06 4.53 50.30 0.12 1.83 5.98 0.93 Decile 3 1.40 6.11 1.53 5.17 0.91 1.63 6.05 0.95 Decile 4 1.39 5.92 1.39 4.97 0.94 1.50 5.84 0.96 Decile 5 1.39 5.75 1.38 4.98 0.95 1.41 5.65 0.97 Decile 6 1.28 5.39 1.29 5.18 0.96 1.28 5.49 0.97 Decile 7 1.27 5.22 1.29 5.18 0.96 1.38 5.45 0.97 Decile 8 1.23 5.10 1.28 5.06 0.96 1.33 5.10 0.98 Decile 9 1.18 4.74 1.27 4.88 0.97 1.28 4.91 0.98 Decile 10 (largest) 1.08 4.55 1.15 4.49 0.99 1.14 4.55 0.99 1090 Momentum 1.13 7.13 0.26 7.99 0.67 0.20 8.79 0.64 (2.86) (0.60) (0.42) 3070 Momentum 0.95 4.70-1.02 20.32 0.21-1.24 25.40 0.18 (3.65) (-0.90) (-0.88) 20
Table 4 CRSP/TDS Matching Statistics This table lists results of attempting to match all December observations from the Center for Research in Securities Prices (CRSP) to Thomson Datastream (TDS). The top panel lists matching statistics by CRSP share code and the bottom panel lists the matching statistics by CRSP exchange for CRSP observations with share code equal to common equity. Share code Full CRSP Sample December Observations Matching Non-matching Fraction of CRSP Matched 10/11 common stock 179,277 108,172 71,105 60.34% 12 common, incorporated outside US 8,183 5,181 3,002 63.31% 13 Americus trust 322 30 292 9.32% 14 closed-end funds 6,028 4,845 1,183 80.37% 15 closed-end funds, incorp. outside US 48 48 0 100.00% 18 REITs 2,572 1,874 698 72.86% 2 Certificates 164 130 34 79.27% 3 ADRs 6,034 633 5,401 10.49% 4 SBIs 3,959 2,361 1,598 59.64% 7 Units 1,916 1,159 757 60.49% Total: 208,503 124,433 84,070 59.68% Exchange Common Equity of U.S. Firms December Observations Matching Non-matching Fraction of CRSP Matched 0 no exchange 1,671 645 1,026 38.60% 1 NYSE 44,256 30,519 13,737 68.96% 2 AMEX 20,506 12,897 7,609 62.89% 3 Nasdaq 112,601 63,999 48,602 56.84% 10 Boston 82 33 49 40.24% 13 Chicago 2 0 2 0.00% 16 Pacific 30 20 10 66.67% 17 Philadelphia 8 0 8 0.00% 20 OTC, non-nasdaq 49 22 27 44.90% other halted or suspended 72 37 35 51.39% Total: 179,277 108,172 71,105 60.34% 21
Table 5 Portfolio Returns Center for Research in Securities Prices (CRSP) portfolios are formed from common equity traded on NYSE/AMEX/Nasdaq with previous month share price greater than or equal to $1.00. Thomson Datastream (TDS) are all securities on constituent lists FAMERA-FAMERZ and DEADUS1-DEADUS6 (32 lists total) with type equal to equity and exchange mnemonic of NYSE, AMEX, Nasdaq-NMS and Nasdaq-NonNMS, screened for non common equity securities, having end of previous month unadjusted price greater than or equal to$1.00, and corrected for data errors. All portfolios are equal-weighted except as noted in table. Size deciles are calculated in December of each year using all NYSE securities. 1090 Momentum refers to the average monthly return of a strategy long past winners defined as the top 10% of stocks sorted by return over months t-2 through t-12, and short past losers. Similarly for 3070 except winners and losers are defined as the top 30/ bottom 30%. t-statistics are in parentheses. Monthly returns, 1975-2002 CRSP Screened and Corrected TDS Average σ Average σ ρ Equal-weighted Market Return 1.29 5.46 1.51 5.16 1.00 Value-weighted Market Return 1.13 4.57 1.13 4.47 1.00 NYSE 1.35 4.95 1.47 4.75 0.99 AMEX 1.29 5.77 1.36 5.21 0.97 NMSNAS 1.28 5.86 1.66 5.91 0.99 Decile 1 (smallest) 1.33 5.83 2.69 5.76 0.93 Decile 2 1.32 6.03 1.55 5.79 0.94 Decile 3 1.40 6.11 1.54 5.95 0.95 Decile 4 1.39 5.92 1.40 5.79 0.96 Decile 5 1.39 5.75 1.35 5.62 0.97 Decile 6 1.28 5.39 1.22 5.47 0.97 Decile 7 1.27 5.22 1.33 5.41 0.97 Decile 8 1.23 5.10 1.31 5.09 0.98 Decile 9 1.18 4.74 1.25 4.89 0.98 Decile 10 (largest) 1.08 4.55 1.12 4.54 0.99 1090 Momentum 1.97 6.66 1.03 6.36 0.95 (5.30) (2.92) 3070 Momentum 1.23 4.39 0.79 4.15 0.97 (5.04) (3.41) 22
Table 6 Life Expectancy by Year This table reports the average life expectancy for all firms in January of each year, reported separately for Center for Research in Securities Prices (CRSP) data and the 'screened' data from Thomson Datastream (TDS). CRSP data contains all common equities traded on NYSE/AMEX/Nasdaq. TDS contains all equity traded on NYSE/AMEX/Nasdaq-NMS/Nasdaq-nonNMS screened for non-common equity securities using the method described in the body of the paper. Life expectancy is the average of months remaining for each firm with valid data in January of that year. Number of observations is the number of valid observations in January of each year. Means test is the p-value for a nonparametric Wilcoxian test of the null that the samples have equal mean. CRSP TDS Max Avg. Life Number Avg. Life Number Means Year Life Expectancy Obs. Expectancy Obs. Difference Test 1975 336 167.9 4,856 237.8 2,388 69.90 [0.00] 1976 324 160.5 4,862 226.9 2,440 66.46 [0.00] 1977 312 153.4 4,885 215.9 2,469 62.54 [0.00] 1978 300 147.8 4,811 204.4 2,496 56.63 [0.00] 1979 288 144.4 4,728 193.3 2,533 48.99 [0.00] 1980 276 139.9 4,687 182.2 2,574 42.30 [0.00] 1981 264 133.8 4,875 172.0 2,661 38.23 [0.00] 1982 252 126.1 5,216 162.8 2,772 36.74 [0.00] 1983 240 121.6 5,165 153.1 2,823 31.51 [0.00] 1984 228 114.8 5,802 144.0 3,111 29.25 [0.00] 1985 216 109.6 5,904 138.8 3,150 29.16 [0.00] 1986 204 107.2 5,882 133.6 3,229 26.48 [0.00] 1987 192 105.9 6,196 126.2 3,565 20.32 [0.00] 1988 180 101.0 6,429 117.3 3,832 16.25 [0.00] 1989 168 98.9 6,175 109.7 3,900 10.78 [0.00] 1990 156 95.4 5,970 106.7 3,731 11.31 [0.00] 1991 144 91.2 5,810 104.4 3,606 13.17 [0.00] 1992 132 86.3 5,894 100.2 3,634 13.87 [0.00] 1993 120 81.5 6,009 94.1 3,764 12.66 [0.00] 1994 108 73.8 6,548 86.1 4,141 12.33 [0.00] 1995 96 66.0 6,835 78.3 4,289 12.34 [0.00] 1996 84 58.8 7,073 73.4 4,281 14.59 [0.00] 1997 72 51.0 7,524 68.2 4,417 17.20 [0.00] 1998 60 43.6 7,505 57.7 4,753 14.10 [0.00] 1999 48 36.9 7,062 45.9 5,123 9.02 [0.00] 2000 36 29.7 6,713 34.1 5,635 4.44 [0.00] 2001 24 21.0 6,363 22.5 5,945 1.43 [0.00] 2002 12 11.4 5,663 11.6 5,685 0.25 [0.00] 23
Table 7 Size Decile Breakpoints Center for Research in Securities Prices (CRSP) breakpoints are formed from common equity traded on the NYSE. Thomson Datastream (TDS) breakpoints are formed from all securities on constituent lists FAMERA-FAMERZ and DEADUS1-DEADUS6 (32 lists total) with type equal to equity and exchange mnemonic of NYSE. Breakpoints are applied to all securities in the sample without regard to exchange. TDS-Raw refers to the original data as originally downloaded, Screened and Corrected refers to the removal of non common equity and the correction of obvious data errors. Annual Decile Equity Count is the total number of observations in that decile for the full year. CRSP TDS - Raw Screened and Corrected TDS Decile Breakpoint Annual Decile Equity Count Decile Breakpoint Annual Decile Equity Count Decile Breakpoint Annual Decile Equity Count December, 1975 Decile 1 (smallest) 16.22 27,029 2.24 3,541 2.10 3,242 Decile 2 25.57 5,242 5.43 3,342 5.21 2,897 Decile 3 39.92 4,440 10.06 3,202 9.91 3,231 Decile 4 60.85 4,001 17.09 2,749 17.19 3,029 Decile 5 92.82 3,296 27.75 3,255 28.85 2,888 Decile 6 151.92 2,951 46.53 2,645 51.26 2,939 Decile 7 248.70 2,242 75.45 2,678 87.43 2,731 Decile 8 461.15 2,380 176.63 2,846 203.98 3,032 Decile 9 815.92 1,876 516.70 2,596 563.81 2,631 Decile 10 (largest) 2,227 2,853 2,748 Total 55,684 29,707 29,368 Avg. NYSE Mkt Cap 12/1975 443.24 263.49 282.79 December, 2001 Decile 1 (smallest) 105.47 29,788 25.31 19,733 105.47 29,219 Decile 2 260.19 9,304 76.86 17,386 225.83 8,664 Decile 3 444.90 5,650 138.30 9,978 388.89 6,219 Decile 4 717.29 4,637 230.74 8,337 630.55 5,090 Decile 5 1,117.65 3,751 383.05 7,514 988.23 4,233 Decile 6 1,663.80 2,865 665.05 7,065 1,496.75 3,316 Decile 7 2,661.05 2,463 1,212.01 6,231 2,378.91 2,830 Decile 8 5,122.23 2,462 2,366.94 4,905 4,346.44 2,680 Decile 9 12,236.69 2,223 6,254.85 4,252 10,632.91 2,606 Decile 10 (largest) 1,931 4,044 2,216 Total 65,074 89,445 67,073 Avg. NYSE Mkt Cap 12/2001 6,466.97 3,773.45 5806.00 24
Table 8 Dividends This table lists summary dividend information for the sample of Center for Research in Security Prices (CRSP) data that we are able to match to Thomson Datastream (TDS) by both firm and date. CRSP/TDS Matching Sample Observations with zero dividends 85.14% 14.86% 781,043 Observations with non-zero dividends 136,353 Observations with matching non-zero dividend amounts 93.31% 6.02% 127,236 Observations with non-matching dividend amounts 8,215 CRSP>0, TDS=0 5,585 67.99% CRSP=0, TDS>0 1,071 13.04% CRSP>0, TDS>0 1,559 18.98% Missing TDS Price Data 902 0.66% Total non matching amounts Total 917,396 100% 25
12000 10000 8000 6000 4000 2000 0 Figure 1 CRSP sample 1975 1980 1985 1990 1995 2000 Year CRSP with match CRSP with nomatch Percent Matched 26 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 # of firms
14000 12000 10000 8000 6000 4000 2000 0 Figure 2 TDS sample 1975 1980 1985 1990 1995 2000 Year DS with match DS with nomatch Percent Matched 27 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 # of firms
7.0% 6.0% 5.0% 4.0% 3.0% 2.0% 1.0% 0.0% Figure 3 CRSP vs. TDS Market Dividend Yield Common Sample 28 197512 197612 197712 197812 197912 198012 198112 198212 198312 198412 198512 198612 198712 198812 198912 199012 199112 199212 199312 199412 199512 199612 199712 199812 199912 200012 200112 200212 Year/Month CRSP TDS Dividend Yield
7.0% 6.0% 5.0% 4.0% 3.0% 2.0% 1.0% 0.0% Figure 4 CRSP vs. TDS Market Dividend Yield All Available Observations 29 197512 197612 197712 197812 197912 198012 198112 198212 198312 198412 198512 198612 198712 198812 198912 199012 199112 199212 199312 199412 199512 199612 199712 199812 199912 200012 200112 200212 Year/Month CRSP TDS Dividend Yield