The current issue and full text archive of this journal is available at wwwemeraldinsightcom/0001-253xhtm AP 474 Received 9 July 2008 Revised 21 January 2009 Accepted 6 February 2009 Monitoring web traffic source effectiveness with Google Analytics An experiment with time series Beatriz Plaza Faculty of Economics, University of the Basque Country, Bilbao, Spain Abstract Purpose The aim of this paper is to develop a new user-friendly in-house tracking methodology for academics to analyse the effectiveness of visits (return visit behaviour and length of sessions) depending on their traffic source: direct visits, referring site entries and search engine visits In other words, how deep do visitors navigate into the web site? Which is their internal performance depending on their traffic source? Design/methodology/approach This paper addresses these questions by time series analysis of Google Analytics data Some statistical matters with regard to the use of Google Analytics data in combination with time series methodology are fine-tuned Findings Return visits are the main engine for nurturing session length, but which type of traffic source nurtures these return visits? In order to answer this question, an important distinction must be made between total return visits and marginal return visits Site entries stay longer to the extent their marginal return effectiveness is higher For our particular web site direct visits are the most effective ones, followed by search engine visits and only thirdly link-entries Research limitations/implications This methodology is critical for an effective web site traffic source monitoring and benchmarking that may lead to better web site strategies Originality/value The importance of this paper is not the particular web site but the new methodology tested to arrive at these results, an experiment that could be repeated with different web sites Keywords Internet, User studies, Worldwide web, Systems analysis, Data mining Paper type Research paper Aslib Proceedings: New Information Perspectives Vol 61 No 5, 2009 pp 474-482 q Emerald Group Publishing Limited 0001-253X DOI 101108/00012530910989625 Introduction Hundreds of thousands of academics worldwide own at least a small web site with a log analyser program available to them This log analyser provides plain and simple statistics concerning the web site For instance, the number of visitors, the average number of page views per visitor, average page duration, most requested pages, domain classes and referrers One problem is that these statistics are static In other words, log analysers usually do not provide time series data This lack is critical when examining the effectiveness of a web site The main idea of this paper is that the analysis of the effectiveness of a site s traffic source lies necessarily in the use of time series analysis The purpose of this paper is to set up the methodology, an experiment that could be repeated with different websites This work presents an experiment done with the information that Google Analytics offers for an academic resource about the number of visits on a web site and the traffic
source, which includes organic results in search engines, links from referral web pages or direct access In other words, it investigates the differences between sessions started by direct connection by typing the site name, by a link on another site, or from a search engine with regard to return visit behaviour and length of sessions The importance of this paper is not the particular academic web site wwwscholars-on-bilbaoinfo, but the new methodology tested to arrive at these results, a methodology affordable for non-expert data miners and any academic The case study must be presented only as a way to explain the new methodology because it is a particular case and it is quite possible that in other websites results are completely different regarding the information architecture, the educational goals of the web site and the type of services it delivers How deep do visitors navigate into the web site? Which is their internal performance depending on their traffic source? Are search engine visits more effective than referring site entries? This paper addresses these questions by time series analysis of Google Analytics data, in order to compare the performance of visits depending on their source: direct visits, referring site visits and search engine visits It is understood that to maximize the effectiveness of the entries is synonymous to maximizing the number of pages per visit (ie how deep into the web site they navigate), to maximize the return rate and to minimize bounce rates[1] This is a brief exploratory communication of a work in progress in order to open new paths of research with regard to user-friendly in-house web site tracking for academics, a method that could be interesting for a wider audience Why use Google Analytics? Firstly, and most importantly for the purpose of this study, it is used because Google Analytics provides time series data Moreover, it is also employed because Google Analytics is a free service offered by Google that generates detailed statistics about the visits to a web site, and which is a user friendly application with the guarantee of Google technology This tracking application, external to the web site, records traffic by inserting a small piece of HTML code in every page of the web site Google Analytics tells the web owner how visitors found the site and how they interact with it Users will be able to compare the behaviour of visitors who were referred from search engines and emails, from referring sites and direct visits, and thus gain insight into how to improve the site s content and design Several articles in relation to data mining and web site tracking have been published by Aslib Proceedings in recent years (Fourie and Bothma, 2007; MacFarlane, 2007; Mansourian, 2008; Nicholas et al, 2008) However, the combination of Google Analytics data with time series analysis is a novelty, not only for Aslib Proceedings, but for the whole discipline Monitoring web traffic source effectiveness 475 Profile of the web site In July 2006 a non-profit organization (based in Gernika, Basque Country-Spain) launched http://wwwscholars-on-bilbaoinfo (Art4pax Foundation, 2008) in order to improve the dissemination of R&D results in the field of Art-related Humanities and Social Sciences scientific production[2], through the exchange of research work on the Guggenheim Museum Bilbao case and the city s urban regeneration This locally based web site encompasses academic works that analyse the urban regeneration of the city of Bilbao (among others, strategic plans, infrastructures, the Guggenheim Museum Bilbao and dilemmas, cultural tourism, gentrification, uneven development, creative
AP 476 industries and artists) Each work includes the abstract and a web-link to its pdf/word file Due to the fact that each work is displayed in a single page, the number of pages per visit tells whether the visitors are attracted by the content or not The Google Analytics traffic overview (Figure 1) shows that all traffic sources sent a total of 4,781 visits[3] from 1 February 2007 to 31 December 2008 Of those visits 802 came directly to this site, referring sites sent 2,407 visits via 82 sources, and search engines sent a total of 1,572 visits Reference site traffic is, by far, the main source of entries for wwwscholars-on-bilbaoinfo: 50 per cent of the total incoming visits Therefore, if we were to design a strategic plan to increase traffic, should we focus on link-building? Surprisingly, the answer is no To explain this, we must tackle the problem of how to measure traffic source effectiveness Google Analytics data and time series methodology: preliminary statistical issues Google Analytics allows users to export report data in MS Excel format, which when transformed can be analyzed with time series statistical programs In this case, the software EViews is made use of Before the hypothesis testing, some statistical matters with regard to the use of Google Analytics data in combination with time series methodology must be considered Normal distribution Initially a data set with 4,781 entries for 730 days drawn from Google Analytics was employed to analyse the performance of the web site from 1 February 2007 to 31 December 2008 However, these daily series could not be approximated well by the normal distribution, as showed by the Statistical Bera-Jarque normality tests[4] To solve the problem, daily data were transformed into weekly data series and the Statistical Bera-Jarque normality tests calculated again[5] The new tests show a weekly data set of 99 weeks that can be approximated well by the normal distribution Figure 1 Google Analytics traffic overview for www scholars-on-bilbaoinfo (daily data)
Refining data from bounce entries? The amount of false clicks, or bounce entries, is the number of times visitors immediately exit the web site from the entrance page The data series employed in this exercise are not refined, since this enables delimiting the bounce visits influence on the dependent variables Analyzing bounce visits behaviour is of vital importance for the purpose of this paper Number of entries and sample size It seems that at least 1,000 entries must be requested in order to obtain statistically robust conclusions from the Google Analytics reports However, from the time series analysis perspective, this may not be sufficient In the breakdown for the explanatory variables to be studied, sample size is also important The aim of this paper is to analyze traffic sources (direct entries, referring sites visits and search engine entries) and to compare their performance All traffic sources sent a total of 4,781 entries from 1 February 2007 to 31 December 2008, 802 of which came directly to the web site, referring sites sent 2,407 visits and search engines sent 1,572 total visits Therefore, the sample size for each explanatory variable (direct entries, referring sites visits and search engine entries) seems sufficient to ensure the robustness of the results Monitoring web traffic source effectiveness 477 Stationary time series Empirical work based on time series data assumes that the underlying time series are stationary In fact, invalid regressions can arise if the time series are not stationary Although at this point the precise technical meaning of stationarity is not going to be introduced, loosely speaking a time series is stationary if its mean and variance do not vary systematically over time For this reason tests of stationarity should precede tests of causality Dickey-Fuller stationarity tests are calculated for each variable[6] and almost all show stationary In view of these properties, this study employs autoregressive moving average (ARMA) models applied to time series Hypothesis testing Two questions arise: (1) How deep into the web site do return visitors navigate in comparison with new visitors? Is it vital for the owners of wwwscholars-on-bilbaoinfo to nurture return entries? If so, why? (2) Which type of traffic source (direct, reference or search engine) nurtures return visits? Do search engine visits return more than direct ones, and more than reference site visits? We will start by testing the first question If we were to make use of common sense, we would conclude by saying that return visits navigate deeper into the web site But probably we would not be able to determine to what extent and whether or not new visits spend more time on the site Let s see what happens with the calculations for wwwscholars-on-bilbaoinfo The fitted estimation is as follows Results from Table I show that the number of pages per entry grows by 007 out of every extra return visit, whereas the marginal effect of new visits is nil That is to say that return visits are the main engine for nurturing session length for
AP 478 Table I Testing the influence of new visitors versus return entries on pages per entry Variable Coefficient Std error t-statistic Prob New visits 0012417 0016706 0743225 04592 Return visits 0076581 0032794 2335185 00217 C 4688200 0562589 8333266 00000 DMM5_04_08 7899409 1797436 4394821 00000 DMM11_16_08 6262933 1802976 3473665 00008 DMM8_10_08 5116736 1807417 2830966 00057 R 2 0344413 Mean dependent var 6168283 Adjusted R 2 0309166 SD dependent var 2149174 SE of regression 1786316 Akaike info criterion 4056880 Sum squared resid 2967561 Schwarz criterion 4214160 Log likelihood 21948156 F-statistic 9771503 Durbin-watson stat 2140876 Prob (F-statistic) 0000000 Notes: The tests used are well-adjusted; Breusch-Godfrey serial correlation LM test: F-statistic; 139, probability 020; white heteroskedasticity test: F-statistic 126, probability 027; Jaque-Bera 56, probability 006; dependent variable: pages per visit, method: least squares, date: 01/01/2009, time: 12:51, sample(adjusted): 2/04/2007 12/21/2008, included observations: 99 after adjusting endpoints wwwscholars-on-bilbaoinfo, which makes sense But, which type of traffic source nurtures these return visits? According to the reading of the results in Table II, 036 out of every additional search engine visit returns, 029 out of every extra direct entry visits the site again, and only 021 out of every reference site visit returns; these results can be false It should be kept in mind that data series have not been refined from their bounce visits, which may have biased the estimates Therefore, we can ask the question: How many referring site Variable Coefficient Std error t-statistic Prob Table II Testing the influence of traffic sources on return visits Direct visits 0299393 0076333 3922210 00002 Reference site visits 0217500 0042676 5096494 00000 Search engine visits 0368728 0066064 5581341 00000 C 22842890 1458388 21949338 00543 DMM4_15_07 1531550 3502807 4372351 00000 MA(1) 0310376 0101339 3062768 00029 SMA(3) 0195934 0105686 1853930 00670 R 2 0707674 Mean dependent var 1074747 Adjusted R 2 0688610 SD dependent var 6500147 SE of regression 3627235 Akaike info criterion 5482901 Sum squared resid 1210429 Schwarz criterion 5666394 Log likelihood 22644036 F-statistic 3711960 Durbin-watson stat 1975181 Prob(F-statistic) 0000000 Inverted MA roots 029 þ 050i 029-050i 2031 2058 Notes: The roots of the MA process are outside the unit circle; the tests used are well-adjusted, and are the ones generally used in calculations of this nature; Breusch-Godfrey serial correlation LM test: F-statistic 027, probability 097; white heteroskedasticity test: F-statistic 126, probability 027; Jaque-Bera 009, probability 095; dependent variable: return visits, method: least squares, date: 01/01/ 09 time: 12:25, sample(adjusted): 2/04/2007 12/21/2008, included observations: 99 after adjusting endpoints, convergence achieved after seven iterations, backcast: 1/07/2007 1/28/2007
visits enter and leave the site immediately? The answer is 039 out of every additional entry, according to regression 3 (Table III) This means that, for this particular web site, traffic growth from referring site visits may become problematic due to their high marginal bounce rate The search engine marginal bounce rate is 037 and only 018 out of every additional direct visit is bounce (Table III) To summarize, the reading of the scatter plots (Figures 2 and 3) drives us to the same conclusions as regressions 1, 2 and 3 In principle, the more bounce visits, the less return entries According to the estimates for wwwscholars-on-bilbaoinfo, direct visits show low bounce rates and medium return rates (top left part of the figure) Monitoring web traffic source effectiveness 479 Variable Coefficient Std error t-statistic Prob Direct visits 0189128 0079023 2393312 00187 Reference site visits 0399009 0036214 1101815 00000 Search engine visits 0377280 0056544 6672385 00000 C 2086758 1140682 1829396 00705 DMM4_15_07 21269395 3867878 23281890 00014 R 2 0731237 Mean dependent var 1908080 Adjusted R 2 0719800 SD dependent var 6859704 SE of regression 3631108 Akaike info criterion 5466138 Sum squared resid 1239385 Schwarz criterion 5597204 Log likelihood 22655738 F-statistic 6393766 Durbin-watson stat 1830426 Prob (F-statistic) 0000000 Notes: The tests used are well-adjusted; Breusch-Godfrey serial correlation LM test: F-statistic 059, probability 078; white heteroskedasticity test: F-statistic 047, probability 085; Jaque-Bera 083, probability 065; dependent variable: bounce visits, method: least squares, date: 01/01/09 time: 12:33, sample(adjusted): 2/04/2007 12/21/2008, included observations: 99 after adjusting endpoints Table III Testing the influence of traffic sources bounce visits 05 04 Return visit rate 03 02 01 00 02 03 04 05 06 07 Bounce rate Figure 2 The less bounce rate, the more return visit rate
AP 15 480 Pages per visit 10 5 Figure 3 Return visits navigate deeper into the web site 0 0 10 20 30 40 Return visits (weekly) Link-entries, on the contrary, show high bounce rates and lower return rates (lower right part of the figure) Search engine visits are located at the centre of Figure 2 Final remarks This work presents an experiment done with the time series that Google Analytics offers for a web site A new methodology is developed to analyse the effectiveness of visits depending on their traffic source: direct visits, referring site entries and search engine visits If we were to draw up a strategic plan to expand wwwscholars-on-bilbaoinfo web site s traffic effectiveness, we would first try to promote the number of direct visits Secondly, we would try to improve the web site s search-engines visibility And only thirdly, would we employ time in link-building Why? Because for this particular case, referring site visits are the less productive in terms of extra return visits and visit depth, and the types of visit that bounce the most To summarize, an important distinction must be made between total return visits and marginal return visits Total return visits are the total number of visitors that come back within the given time period Marginal return visits are the additional return visits gained from one extra visit within our given period of time Thus, we might refer to the marginal return visit that the web site gains from its eleventh link-entry of the week And this is precisely one of the main contributions of this study: that the site entries make the web site effective only to the extent their marginal effectiveness is productive The agenda for future research requests the repetition of the experiment with different websites, to delimit more accurately the influence of each traffic source Future studies may also enable us to determine why link-entries return rates are low for wwwscholarson-bilbaoinfo, which requests a more segmented approach to link-entries to delimit each type of referring site, and may shed light on the requested growth rate of link-building in order to nurture its real traffic growth They may also help us to see whether or not the
quality of referring sites links could be improved, all the above aspects of which are critical to complete an in-house web traffic source monitoring and benchmarking Notes 1 Bounce rate is the percentage of visitors who enter a site (or a page) and then leave immediately without visiting any other pages It could also be expressed in terms of time spent on site (eg, users who spend five seconds or less on the site) irrespective of the number of pages they view 2 Listed by wwwelearningeuropainfo/from February 2007, wwwscholars-on-bilbaoinfo was initially set up for the purposes of education and teaching in the frame of ICTs for academic purposes Its use was shown to be limited to the local academic community for the reasons already analyzed by Plaza and Haarich (2008) and Eynon (2008) At present, the majority of visitors enter from foreign countries, an important proportion of which are research centres and academics On some ideas about making research more visible internationally through e-science projects see Schroeder and Fry (2007) On ICTs and structural changes see Dutton et al (2004) 3 As suggested by Talja et al (2007), a high degree of across-fields scattering of relevant literature, in this case contained in wwwscholars-on-bilbaoinfo, may have increased the number of visits of this less established research area 4 The variables under study are the following: Direct visits (DV), reference site visits (RFV), search engine visits (SV), new visits (NV), return visits (RV) and bounce visits (BV) See the following Normality tests for the variables (daily data): Jarque-Bera Test for DV: 492 Prob: 0 Monitoring web traffic source effectiveness 481 Jarque-Bera Test for RFV: 57 Prob: 0 Jarque-Bera Test for SV: 96 Prob: 0 Jarque-Bera Test for NV: 592 Prob: 0 Jarque-Bera Test for RV: 623 Prob: 0 Jarque-Bera Test for BV: 121 Prob: 0 5 Normality tests for the variables (weekly data): Jarque-Bera Test for DV: 73 Prob: 002 Jarque-Bera Test for RFV: 26 Prob: 026 Jarque-Bera Test for SV: 045 Prob: 079 Jarque-Bera Test for NV: 225 Prob: 032 Jarque-Bera Test for RV: 26 Prob: 027 Jarque-Bera Test for BV: 37 Prob: 015 6 Augmented Dickey-Fuller Unit Root Tests: DV ADF Test Statistic: -347 10 per cent Critical Value: 2259 RFV ADF Test Statistic: 2210 10 per cent Critical Value: 2259 SV ADF Test Statistic: 2267 10 per cent Critical Value: 2259 NV ADF Test Statistic: 2227 10 per cent Critical Value: 2259 RV ADF Test Statistic: 2346 10 per cent Critical Value: 2259 BV ADF Test Statistic: 2289 10 per cent Critical Value: 2259
AP 482 If the computed absolute value of the ADF statistic exceeds the 10 per cent critical value, then the time series is stationary In this case, RFV and NV are only weakly stationary To solve this problem, time series were differenced once (ARIMA models were applied) and the results stayed the same Therefore, the author consistently uses ARMA models References Art4pax Foundation (2008), Scholars on Bilbao available at: http://wwwscholars-on-bilbaoinfo Dutton, WH, Kahin, B, O Callaghan, R and Wyckoff, AW (2004), Transforming Enterprise: The Economic and Social Implications of Information Technology, MIT Press, Cambridge, MA Eynon, R (2008), The use of the world wide web in learning and teaching in higher education: reality and rhetoric, Innovations in Education and Teaching International, Vol 45 No 1, pp 15-23 Fourie, I and Bothma, T (2007), Information seeking: an overview of web tracking and the criteria for tracking software, Aslib Proceedings, Vol 59 No 3, pp 264-84 MacFarlane, A (2007), Evaluation of web search for the information practitioner, Aslib Proceedings, Vol 59 Nos 4/5, pp 352-66 Mansourian, Y (2008), Web search efficacy: definition and implementation, Aslib Proceedings, Vol 60 No 4, pp 349-63 Nicholas, D, Rowlands, I, Clark, D, Huntington, P, Jamali, HR and Ollé, CUK (2008), UK scholarly e-book usage: a landmark survey, Aslib Proceedings, Vol 60 No 4, pp 311-34 Schroeder, R and Fry, J (2007), Social science approaches to e-science: framing an agenda, Journal of Computer-Mediated Communication, Vol 12 No 2, pp 563-82 Talja, S, Vakkari, P, Fry, J and Wouters, P (2007), Impact of research cultures on the use of digital library resources, Journal of the American Society for Information Science and Technology, Vol 58 No 11, pp 1674-85 Further reading Plaza, B and Haarich, SN (2009), The Guggenheim Museum Bilbao in scientific journals: asymmetries between the American art perspective and the European regional planning viewpoint, MPRA Paper No 10751, available at: http://mpraubuni-muenchende/10751/ 1/MPRA_paper_10751pdf (accessed 1 February 2009) Corresponding author Beatriz Plaza can be contacted at: beatrizplaza@ehues To purchase reprints of this article please e-mail: reprints@emeraldinsightcom Or visit our web site for further details: wwwemeraldinsightcom/reprints