An Examination of the Long-Tail Hypothesis in Online News Market: The Case of Google News

From this document you will learn the answers to the following questions:

Did a major redesign of Google News result in a greater or less stories per page?

Who was the thesis submitted to?

What did Hengyi Zhu do to the distribution of the link frequency for each domain?

Similar documents
Internet and the Long Tail versus superstar effect debate: evidence from the French book market

THE INTERNET AND THE NEWS MEDIA

Chapter 27: Taxation. 27.1: Introduction. 27.2: The Two Prices with a Tax. 27.2: The Pre-Tax Position

Table of Contents Brightcove, Inc. and TubeMogul, Inc Page 2

The Economics of Digitization: An Agenda for NSF. By Shane Greenstein, Josh Lerner, and Scott Stern

The State of Coupons and the Role of Mobile How Consumers Leverage Mobile to Save

Marketing Mix Modelling and Big Data P. M Cain

Study Guide #2 for MKTG 469 Advertising Types of online advertising:

THE SME S GUIDE TO COST-EFFECTIVE WEBSITE MARKETING

Predicting Box Office Success: Do Critical Reviews Really Matter? By: Alec Kennedy Introduction: Information economics looks at the importance of

Simple linear regression

Predict the Popularity of YouTube Videos Using Early View Data

Website Audit Reports

Mobile Strategy and Design

Britepaper. How to grow your business through events 10 easy steps

Session 7 Bivariate Data and Analysis

CALCULATIONS & STATISTICS

Chapter-1 : Introduction 1 CHAPTER - 1. Introduction

Algorithms and optimization for search engine marketing

Economics of Strategy (ECON 4550) Maymester 2015 Applications of Regression Analysis

How Video Rental Patterns Change as Consumers Move Online

Social Media & Internet Marketing :: Menu of Services

PRODUCTION. 1The Surplus

Succeed in Search. The Role of Search in Business to Business Buying Decisions A Summary of Research Conducted October 27, 2004

Predicting Flight Delays

Drop Shipping ebook. What s the Deal with Drop Shipping?

INTERNATIONAL JOURNAL OF RESEARCH IN COMPUTER APPLICATIONS AND ROBOTICS ISSN

Chapter 3 Productivity, Output, and Employment

SEARCH ENGINE MARKETING 101. A Beginners Guide to Search Engine Marketing

! Giving the subscribers a choice of watching streaming content or receiving quickly delivered DVDs by mail.

2013 Retailer ecommerce Study

Chapter 10. Key Ideas Correlation, Correlation Coefficient (r),

Extreme Computing. Big Data. Stratis Viglas. School of Informatics University of Edinburgh Stratis Viglas Extreme Computing 1

Marketing & Site Recommendations

Analyzing the Elements of Real GDP in FRED Using Stacking

Google Product. Google Module 1

Christopher Seder Affiliate Marketer

Week 3&4: Z tables and the Sampling Distribution of X

Beating the NCAA Football Point Spread

Best Practice Search Engine Optimisation

PayPal Integration Guide

ANSWERS TO END-OF-CHAPTER QUESTIONS

Downloaded from UvA-DARE, the institutional repository of the University of Amsterdam (UvA)

Michelle Light, University of California, Irvine 10, August 31, The endangerment of trees

Digital Marketing, How To Guide for American Express Merchants

Archiving the Social Web MARAC Spring 2013 Conference

Gutenberg 3.2 Ebook-Piracy Report

Lecture 13/Chapter 10 Relationships between Measurement (Quantitative) Variables

Introduction to Inbound Marketing

Linear Programming for Optimization. Mark A. Schulze, Ph.D. Perceptive Scientific Instruments, Inc.

Managerial Economics Prof. Trupti Mishra S.J.M. School of Management Indian Institute of Technology, Bombay. Lecture - 13 Consumer Behaviour (Contd )

Lecture 2. Marginal Functions, Average Functions, Elasticity, the Marginal Principle, and Constrained Optimization

The fundamental question in economics is 2. Consumer Preferences

Study Questions for Chapter 9 (Answer Sheet)

cprax Internet Marketing

2010 Brightcove, Inc. and TubeMogul, Inc Page 2

Microsoft Advertising adcenter Campaign Analytics Getting Started Guide

SEO MADE SIMPLE. 5th Edition. Insider Secrets For Driving More Traffic To Your Website Instantly DOWNLOAD THE FULL VERSION HERE

Master of Science in Marketing Analytics (MSMA)

7 AGGREGATE SUPPLY AND AGGREGATE DEMAND* Chapter. Key Concepts

1 Which of the following questions can be answered using the goal flow report?

Mobile Commerce for Multichannel Retailers

The Long Road to Conversion:

Where Is Interactive Marketing Heading?

Television Advertising is a Key Driver of Social Media Engagement for Brands TV ADS ACCOUNT FOR 1 IN 5 SOCIAL BRAND ENGAGEMENTS

II. DISTRIBUTIONS distribution normal distribution. standard scores

ASSIGNMENT 4 PREDICTIVE MODELING AND GAINS CHARTS

Strategic Online Advertising: Modeling Internet User Behavior with

HIGH SCHOOL MASS MEDIA AND MEDIA LITERACY STANDARDS

McKinsey Problem Solving Test Practice Test A

Elasticity. I. What is Elasticity?

Principles of Economics: Micro: Exam #2: Chapters 1-10 Page 1 of 9

Inventory Management Intelligent Insights ebook

So today we shall continue our discussion on the search engines and web crawlers. (Refer Slide Time: 01:02)

webinars creating blog posts customer quotes CONTENT MARKETING for MINISTRIES video tutorials lead strategy inform sharing A publication of

This unit will lay the groundwork for later units where the students will extend this knowledge to quadratic and exponential functions.

Annex 8. Market Failure in Broadcasting

Increase Online Sales. Site Search. Whitepaper

The Effects of Start Prices on the Performance of the Certainty Equivalent Pricing Policy

Linear Programming. Solving LP Models Using MS Excel, 18

Credit Card Market Study Interim Report: Annex 4 Switching Analysis

Customer Life Time Value

HP WebInspect Tutorial

LECTURE 1 SERVICE INVENTORY MANAGEMENT

Last Updated: 08/27/2013. Measuring Social Media for Social Change A Guide for Search for Common Ground

10 Tips on How to Plan a Successful Internet Business. Robert Rustici

Branding and Search Engine Marketing

OFFICIAL VOICES.COM USER GUIDE A CLIENT S GUIDE TO GETTING STARTED AT VOICES.COM. Go to Voices.com

the Median-Medi Graphing bivariate data in a scatter plot

How Media Drive Online Success: Increasing Web Traffic and Search

Top 12 Website Tips. How to work with the Search Engines

State of the Web Address: Navigating the Ever-Changing Web

PIM for Search Engine Optimization

Google Analytics Guide

Investing in Bond Funds:

Immigration Law Firm GUCL: Updating Traditional Marketing and Combining SEO to Broaden Reach

Are Lottery Players Affected by Winning History? Evidence from China s Individual Lottery Betting Panel Data

The Effect of Dropping a Ball from Different Heights on the Number of Times the Ball Bounces

EVALUATION OF THE PAIRS TRADING STRATEGY IN THE CANADIAN MARKET

Presentation Details: Mobile Marketing, SEO & Visibility: Why You Should Care. Presented To: AMADC

Transcription:

i Wesleyan University The Honors College An Examination of the Long-Tail Hypothesis in Online News Market: The Case of Google News by Hengyi Zhu Class of 2015 A thesis submitted to the faculty of Wesleyan University in partial fulfillment of the requirements for the Degree of Bachelor of Arts with Departmental Honors in Mathematics-Economics Middletown, Connecticut April, 2015

ii Table of Contents Table of Contents Acknowledgements ii iii Abstract 1 Chapter 1. Introduction 2 1.1 News Industry and News Aggregators 2 1.2 The Long-tail 5 1.3 This Thesis 10 Chapter 2. Literature Review 12 Chapter 3. Applying Long Tail Hypothesis to the Online News Market 17 3.1 Data 17 3.2 Summary Statistics 26 3.3 Empirical Models and Results 33 Chapter 4. Long Tail Hypothesis at the Company Level 47 4.1 Data 47 4.2 Summary Statistics 49 4.3 Empirical Models and Results 59 Chapter 5. Conclusion 62 References 64

iii Acknowledgements I thank my thesis advisor, Professor Christiaan Hogendorn, who mentored me for two summers and through my senior year. Without the opportunities you provided and your patience as well as encouragement, I cannot see myself starting and completing this thesis. I thank my MECO advisor, Professor Gilbert Skillman. Not only did your advice help me through my studies at Wesleyan and beyond, your classes enlightened me so much that I shall be forever grateful. I thank Manolis Kaparakis, the director of Quantitative Analysis Center, for your earnest support and devotion to teaching. Your presence is such an important part of my Wesleyan experience. I thank Professor David Constantine from the Mathematics department. Your rigorous and intellectually stimulating teaching style helped me wade through many mathematics challenges. I take this opportunity to express gratitude to the entire faculty in both the Mathematics and Economics departments. I have learned and grown so much in my four years at Wesleyan with your help. Thanks to my thesis writing tutor, Kerry Nix. Without your help, time, and patience, this thesis would not be the same as it is today. Thanks to my friends and family. You are always of paramount importance to me. Your support and love have kept me thriving. I love you all.

1 Abstract The emergence of digital technologies has transformed the news industry, and news aggregators have become the most popular news destinations online. This thesis analyzes how online news aggregators affect the online news distribution over time. Specifically, it examines the distribution of the link frequency for each domain appearing on Google News over different periods of time and tests the long tail hypothesis, which states that the tail of the link frequency distribution should be getting longer and fatter over the years. Since most major news websites are now owned by a small group of companies, I incorporate ownership information into this analysis. I found that although more and more small and niche news websites are getting linked from Google News, each receives links only a limited number of times. The long tail hypothesis is not fully supported at the domain level; over time, the tail is only lengthened, but not fattened. Moreover, domain characteristics affect a domain s link frequency. Analyzing the link frequency distribution at the owner level, I found that with ownership aggregation the tail becomes even thinner.

2 Chapter 1. Introduction 1.1 News Industry and News Aggregators The emergence of digital technologies has transformed the news industry. As shown in Figure 1.1, more and more Americans are getting news online, and newspapers and television are declining in popularity as news sources (Pew Research Center, 2012). Figure 1.1: Where Americans Get News (Pew Research Center, 2012) Within the realm of online news, news aggregators are news sites that do not produce much original content, but rather curate content created by others using a combination of human editorial judgment and computer algorithms. In a typical news aggregator page, each news entry is presented with a title, a brief description, the name of the original content creator, and

3 perhaps photographs from the original article; to access the full article, users may click through the portal and go to the web site of the original content creator. As shown in Figure 1.2 news aggregators have become so popular that more than 50% of people identify a news aggregator as their top source of news (Pew Research Center, 2012). 1 News aggregators were the top three most popular news websites in January 2015, leading by a large margin (ebiz, 2015). Figure 1.2: Where do People Get News Online (Pew Research Center, 2012) News aggregators can be divided into a few different types, despite all serving the same purpose of news aggregation (Athey & Mobius, 2012). Pure aggregators, such as Google News, generally do not make any payments or have any formal relationship with the original content providers; instead, they 1 Figures add to more than 100% because of multiple responses.

4 create pages by crawling 2 the web and using algorithms as well as editorial judgments to organize the content. There are only a few rare cases in which Google News has a direct commercial relationship with the news source. For example, Google News had a relationship with the Associated Press, when Google primarily showed content from the Associated Press, as analyzed by Chiou and Tucker (2011). In contrast, sites like Yahoo! News and MSN primarily present content from contractual partners. Sites like the Huffington Post use a hybrid strategy of curating blogs and aggregating news from other sources. Opinions on news aggregators vary widely. One side accuses news aggregators of stealing traffic from news sites. Rupert Murdoch, owner of News Corp. and The Wall Street Journal stated, The people who simply just pick up everything and run with it steal our stories, we say they steal our stories they just take them. That's Google, that's Microsoft, that's Ask.com, a whole lot of people... they shouldn't have had it free all the time, and I think we've been asleep. Meanwhile, news aggregators contend that they actually drive traffic to news sites. In Google s comment on FTC discussion 2010, a Google spokesperson argues that "Google makes it easy for users to find the 2 Web crawlers are software that discover publicly available webpages. Crawlers look at webpages and follow links on those pages, much like you would if you were browsing content on the web. The crawl process begins with a list of web addresses from past crawls and sitemaps provided by website owners. As crawlers visit these websites, they look for links for other pages to visit.

5 news they are looking for and to discover new sources of information... We send more than four billion clicks each month to news publishers." 3 Since revenue for newspapers has been diminishing due to a decline in print subscriptions and advertising revenue, some Newspapers have been implementing paywalls, which prevent Internet users from accessing webpage content without a paid subscription on their websites to increase their revenue. If you are browsing a news website to which you are not subscribed, you might not be able to read the full stories of your choice. However, despite the debate I mentioned earlier, Google News has worked with most subscription-based news services to ensure that the first article seen by a Google News user does not require a subscription. Although this first article can be seen without subscribing, any further clicks on the article page will prompt the user to log in or subscribe to the news site. 4 1.2 The Long-tail In an earlier era, the blockbuster strategy or the winner takes all society were prominent features (Frank & Cook, 1995). Those strategies favor an application of the Pareto principle, which dictates that 80% of the total revenue is generated by about 20% of the total product line (Koch, 2001). 3 FED. TRADE COMM N, FEDERAL TRADE COMMISSION STAFF DISCUSSION DRAFT: POTENTIAL POLICY RECOMMENDATIONS TO SUPPORT THE REINVENTION OF JOURNALISM (2010), http://www.ftc.gov/opp/workshops/news/jun15/docs/new-staffdiscussion.pdf (hereinafter DISCUSSION DRAFT). 4 If First Click Free isn't a feasible option for the news website, Google will display the "subscription" label next to the publication name of all sources that greet its users with a subscription or registration form. https://support.google.com/news/publisher/answer/40543?hl=en

6 Such strategies lead to economies driven by hits. Alternatively, a long tail 5 view has been trending over recent years. According to this view, total sales revenues of products in the tail, which online retail space makes more easily accessible, are worth more and more, and approach the sales revenues of the hits. An article named The Long Tail was published in Wired in October 2004, and its author Chris Anderson later turned it into a New York Times bestselling book The Long Tail: Why the Future of Business is Selling Less of More in 2006. Three main observations led Anderson to the idea of long tail: (1) the tail of available variety is far longer than we realize; (2) it s now within reach economically; (3) all those niches, when aggregated, can make up a significant market (Anderson, 2006, p. 10). Amazon.com serves as an example: about 30% of its total sales come from products not available in the largest offline retail stores. Amazon has successfully sold enough of the nonhits to establish a marketplace that has not been explored before. According to Anderson, at least three forces are behind the aforementioned phenomena: democratizing the tools of production, cutting the costs of consumption by democratizing distribution, and connecting supply and demand (Anderson, 2006, pp. 52-57). The first force, democratizing the tools of production, entails bringing in more producers and therefore products, lengthening the tail 5 The tail refers to the tail of a quantity versus rank plot. Sample graphs are in the following pages.

7 (Figure 1.3). Improved digital technology enables individuals to do what until just a few years ago only professionals could. Millions of people now have the capacity to make short films or albums, or publish their thoughts to the world. For instance, in the music industry, the number of new albums released grew a phenomenal 36 percent in 2005 to 60,000 titles (up from 44,000 in 2004), largely due to the ease with which artists can record and release their own music. With the available universe of content growing faster than ever, the tail extends rightward. Figure 1.3: Democratize the Tools of Production (Anderson, 2006, pp.54) The second force, democratizing distribution, cuts the costs of consumption and fattens the tail (Figure 1.4). The Internet makes it cheaper to reach more people. Aggregators such as Amazon, ebay, itunes and Netflix provide cheap and easy access to the content being produced to users who might not have access to those goods from traditional distribution channels. With consumers better access to niches, the tail fattens.

8 Figure 1.4: Democratize the Tools of Distribution (Anderson, 2006, pp.55) The third force, connecting supply and demand, introduces consumers to these newly available goods (Figure 1.5). Connecting supply and demand can take the form of anything from Google s wisdom-of-crowds search, itunes recommendations, word-of-mouth, blogs, to customer reviews. As a result, consumers experience lowered search costs 6 of finding niche content; thus, demand is driven down the tail. 6 Search costs refer to anything that gets in the way of finding what you want. Some are monetary while some are not. Nonmonetary search costs include wasted time or hassle in consumption.

9 Figure 1.5: Connect Supply and Demand (Anderson, 2006, pp.56) In economics terms, the three forces of the long tail, which traditional firms do not possess because of the constraints of physical products and limited shelf space, allow Internet firms and ecommerce stores to cut production costs, distribution costs, and search costs so as to bundle a huge inventory of hits and niches. With the help of information technologies, the forces that underline such long tails have been harnessed for competitive advantage (Huang & Wang, 2014). When news media went online, they achieved new efficiencies in manufacturing, distribution, and connecting supply and demand. The unique capacities of Internet provide a foundation for a possible long tail economy for online news. The Internet contributes to the long tail economy for online news by lowering production costs, distribution costs, and search costs, parallel to the three forces of the long tail.

10 1.3 This Thesis The news aggregator of interest for this thesis is Google News. As a pure aggregator, Google News crawls the web and use algorithms to organize its content. Google explains its algorithms in its patent document. 7 The major factors considered by its ranking algorithm include: volume of production from a news source, length of articles, the importance of coverage by the news source, the Breaking News Score, the Human opinion of the news source, audience and traffic, staff size, numbers of news bureaus, the "breadth" of the news source, the global reach of the news sources, and writing style. This enumeration, especially the first few criteria, shows that Google intends to favor large legacy media over smaller or niche news websites. However, based on the forces for the long tail hypothesis, there is the possibility that the long tail hypothesis can also be applied to the online news market. In other words, the small and lesser-known news sites may be benefiting more from Google News over the years. To address this question, this thesis uses data of Google News content since its launch in 2002, analyzes the distribution of the link frequency for each domain over different periods of time and tests the long tail hypothesis that the tail of the link frequency distribution is getting longer and fatter over the years. 7 Systems and methods for improving the ranking of news articles US 20120158711 A1 http://www.google.com/patents/us20120158711

11 Different sections, such as top stories, U.S., World, Sports, Business, Technology, and Entertainment, make up a typical Google News page. A typical Google News page during my period of study is shown in Figure 1.6. A user is directly presented with the top story section and part of the sidebar on the right called the small section. The user need to scroll down the page in order to view the other sections. A domain s frequent appearance in one of the sections indicates an intrinsic characteristic of the domain. For example, if a domain appears frequently in the Entertainment section, this domain is most likely concentrated on covering entertainment news. I will control for those domain characteristics in the tests of the link frequency distribution. Figure 1.6: Sample Google News page (May 3, 2012) retrieved from Archive.org Most major news websites are now owned by a small group of companies. In the highly competitive media industry, consolidation with the ensuing economies of scale is widely seen as a necessary condition for

12 survival (DellaVigna & Hermle, 2014). Numerous mergers have left the news industry dominated by large companies, producing an industry in which the major players are highly integrated. The Columbia Journalism Review website features a dataset of major media companies and their subsidiaries. If a news website from my Google News data is listed here, it belongs to one of those major media companies. I will also analyze how this ownership information affects my tests of the long tail hypothesis. Chapter 2 of this thesis is literature review. Chapter 3 analyzes the long-tail hypothesis at the domain level, controlling for domain characteristics. Chapter 4 incorporates the ownership information and examines the long tail at company s level. Chapter 5 concludes. Chapter 2. Literature Review Many papers have been written that either apply the long tail hypothesis to different industries, especially for the industries with improved digital technology, or test the validity of the forces behind the long tail hypothesis. Elberse and Oberholzer-Gee (2007) study the distribution of sales in the U.S. home video industry for the 2000 to 2005 period, and find a long tail effect; the number of titles that sell only a few copies every week increases almost twofold. At the top end of the distribution, most hits draw smaller

13 audiences. At the tail end, they find that there is a rapidly increasing number of titles that never, or very rarely, sell the long tail appears incredibly flat. Brynjolfsson, Hu and Smith (2010) analyze the change in shape of Amazon s sales distribution curve from 2000 to 2008, and how the change impacts the resulting consumer surplus gains from increased product variety in the online book market. They find that the long tail has grown longer over time, with niche books accounting for a larger share of total sales. Their analyses suggest that by 2008, niche books accounted for 36.7% of Amazon s sales, and the consumer surplus generated by niche books increased at least five fold between 2000 and 2008. The increase in consumer surplus suggests that Amazon s long tail is likely to be a permanent shift instead of a shortlived phenomenon. Also, while previous research has assumed a constant slope between the log of sales and the log of sales rank, they find that the sales of a book drops at a faster rate than a log-linear curve indicates and the slope becomes steeper as a book s sales rank increases, suggesting that there may be forces that limit Amazon s ability to sell books that are extremely niche. Brynjolfsson, Hu and Simester (2011) examine the forces behind the long tail phenomenon. They first use data collected from a multichannel retailer and present empirical evidence that the Internet channel exhibits a significantly less concentrated sales distribution when compared with traditional channels. Then, they control for the differences in product availability between channels, and show that consumer s usage of Internet search and discovery tools, such as recommendation engines, are associated

14 with an increased share of niche products. They conclude that the Internet s long tail is not solely due to the increase in product selection but also partly a reflection of lower search costs on the Internet. Their research validates the first 8 and third 9 forces of the long tail hypothesis introduced in Chapter 1. Peltier and Moreau (2012) use a database of monthly sales of comic books and literature books in France over the period 2003 to 2007, and show that firstly, bestsellers got smaller market shares online than offline, contrary to medium- and low-sellers. Secondly, both online and offline sales shift from the head of the distribution to the tail with increasing magnitude over the period. Thirdly, the long tail appears to be more than just a short-lived phenomenon caused by the specific preferences of early adopters of e- commerce. These three results suggest that online information and distribution tools, whose use increased over the period 2003 to 2007, do have an impact on book distribution and on consumers' purchase decisions. While online sales accounted for only 4% of overall sales in 2007, according to their data, those sales are experiencing strong growth that the advent of the digital book will reinforce. Bourreau, Gensollen, Moreau and Waelbroeck (2013) use data from a survey of 151 French record companies conducted in 2006 to test the long tail hypothesis at the level of the firm. Specifically, they test whether record companies that have adapted to digitization at various levels (artists, 8 Democratizing the tools of production. 9 Connecting supply and demand.

15 scouting, distribution, and promotion) release more new albums without having higher overall sales. They consider two types of output: a commercial output (albums sales) and a creative output (number of new albums released). Their results suggest that adaptation to digitization had a strong and positive impact on the production of new albums (the creative output), but no effect on sales (the commercial output). Their result is in line with the long tail hypothesis in the sense that they are selling less of more ; digitization enhanced the creativity of record companies, leading digitized music labels to release more new albums, but this did not result in higher sales for those labels. Huang and Wang (2014), using survey, third-party traffic metrics, and content analysis, found that the traffic performance of online news sites was significantly impacted by long tail forces, but the impact had not transferred to the news sites financial performance. Online news institutions have responded to the changing market trend of segmentation and niches by deploying a long tail economy in terms of content, service, and participation variety through the aid of information and Internet technologies. However, despite carrying out the long tail model, online news institutions are still encountering the difficulty of turning web traffic into real profit and revenue. As presented above, applying the long tail hypothesis to different industries yields mixed results. The notion of the long-tail hypothesis also differs slightly for different researchers; a tail can be lengthened or shortened, fattened or flattened. For this thesis, a longer tail requires the tail being both

16 lengthened and fattened. Although there are no studies focusing specifically on Google News and the long tail phenomenon, many have studied Google News and aggregation. These studies provide insights on Google News as a news aggregator. Athey and Mobius (2012) analyze the impact of news aggregators on the quantity and composition of internet news consumption. They perform a case analysis of an example in which Google News added local content to their news home page for users who chose to enter their location. Using a dataset of user browsing behavior, they compare users who adopt the localization feature, which includes adding a Local news section, to a sample of control users. They find that users who adopt the localization feature subsequently increase their usage of Google News, which in turn leads to additional consumption of local news. This result suggests that the inclusion of local content by Google News had mixed effects on local outlets: it increased their traffic, especially in the short run, but it also increased the reliance of users on Google News as a source of news, and increased the dispersion of user attention across outlets. In other words, more users go to Google News instead of visiting news website directly; they also get re-directed to news websites that they would not have otherwise visited. Jeon and Esfahani (2012), in one of the few theoretical papers in this field, consider how news aggregators affect the quality choices of newspapers competing on the Internet. To provide a micro-foundation for the role of the aggregator, they build a model of multiple issues in which newspapers choose

17 their quality on each issue. The model captures both business-stealing and readership-expansion effects of the aggregator. They find that the presence of aggregator leads newspapers to specialize their news coverage, and changes quality choices from strategic substitutes to strategic complements. The aggregator is beneficial for consumers, where as it may harm newspapers. However, even if the aggregator harms newspapers, each newspaper may prefer to keep its link with the aggregator. Chapter 3. Applying Long Tail Hypothesis to the Online News Market 3.1 Data Archive.org, also called the Wayback Machine, is a digital archive of the World Wide Web and other information on the Internet created by the Internet Archive, a non-profit organization based in San Francisco, California. Creators Brewster Kahle and Bruce Gilliat originated the Internet Archive Wayback Machine in 1996. It was officially launched in 2001 and is maintained with content from Alexa Internet. The service enables users to see archived versions of web pages across time. For instance, a user can search and view an achieved webpage as it appeared on Feb 15, 2004 and as it appeared the following day.

18 Kahle and Gilliat founded Alexa Internet in 1996. The name Alexa was chosen to pay homage to the Library of Alexandria, drawing a parallel between the largest repository of knowledge in the ancient world and the potential of the Internet to become a similar store of knowledge. Alexa's operation includes archiving of webpages as they are crawled. This database served as the basis for the creation of the Internet Archive accessible through the Wayback Machine. Aside from web crawling, Alexa collects data on browsing behavior from those who have the Alexa Toolbar installed and transmits it to the Alexa website, where it is stored and analyzed, and forms the basis for the company's web traffic reporting. Amazon acquired Alexa in 1999 for approximately 250 million U.S. dollars in Amazon stock. Currently, Alexa is a purely analytics-focused company that competes with other web analytics services, such as Compete.com and Quantcast. According to Archive.org s webpage, most of Wayback Machine s archived web data comes from its own crawls or from Alexa Internet s crawl. Both of those automated crawls tend to find sites that are well linked from other sites. Besides that, some sites are harder to archive than others, and the reasons are as follows. Firstly, Archive.org respects robot exclusion headers. 10 Secondly, JavaScript elements are often hard to archive. Thirdly, if a website requires the crawler to contact the originating server in order to work, it will fail when archived. Moreover, the archive contains crawls of the Web completed by Alexa Internet; if Alexa does not know about a site, it will not be 10 One can exclude its website from being crawled by including a robot exclusion header.

19 archived. Finally, if there are no links to a website, the robot will not find the site. In 2006, Internet Archive launched Archive-It, a subscription service that allows institutions to build and preserve collections of digital content. Archive-It partners can harvest, catalog, manage, and browse their archived collections. All data created using the Archive-It service is hosted and stored by the Internet Archive. Archive-It is very flexible; one can harvest material from the Web using ten different frequencies ranging from daily to annually. Partners develop their own collections and have complete control over which content to archive within those collections. Both Archive-It and Archive.org serve to archive the Internet, but use different methods. Archive.org archives the internet through its automated crawls, while Archive-It allows the owners of websites to decide how they want their websites archived. In October 2013, a save page feature for the Wayback Machine was launched so that every user can archive pages on demand. Web pages archived by this feature will be available almost immediately after the user clicks the save page button on archive.org/web, provided the site allows crawlers. 11 Once a page is saved, one cannot differentiate whether it was archived by automated crawls or through the save page feature. Automated crawls, Archive-It, and the save page feature are the three main methods 11 As mentioned earlier, the presence of a robot exclusion header will prevent a page from being crawled.

20 Internet Archive uses to preserve the Internet, though it predominately uses the first method. As of March 2015, 456 billion web pages have been saved. The Archive s goal is to index the whole World Wide Web without any judgments about which pages are worth saving. The potential importance of the Archive for longitudinal and historical Web research leads to the need to evaluate the biases of its coverage. Thelwall and Vaughan (2004) found that there is significant bias in terms of both rates of inclusion in the Archive and length of time of inclusion by country. However, the Internet Archive is naturally biased by link structures rather than by countries: historical factors have caused the first to map onto the second. Therefore, it is reasonable to believe that there will also be intra-national and other biases that are related to site age and link structures. Caution must be advised in interpreting findings of such studies, unless methods can be devised to bypass these problems. Among those billions of archived web pages, about 14 thousand are archives of Google News (news.google.com). Google News was launched in 2002. The Wayback Machine therefore covers the entire history of Google News. Figure 3.1 presents the frequency of scrapes of Google News archived from September 2002 to the end of 2013. Before May 2004, Google News pages were rarely saved, usually fewer than 5 times per month. In June 2004, the frequency suddenly picked up: pages were archived for more than half of the days in each month, and there were sometimes even multiple pages per day. This pattern continued into late 2006, and then the frequency dropped

21 again. During 2007 and 2008, the amount of pages archived per month varied greatly, but overall there was a decrease in the total amount archived. Then, there was a recovery in year 2009, Google News was saved once almost every day for the entire year as well as the first half of 2010. Later, there was a slight dip in the second half of 2010 that continues into 2011. However, starting mid-2011, multiple pages per day were saved and the frequency was getting higher and higher; this pattern persists today and recently about 10 to 20 pages are archived every day; the number seems to keep growing. Figure 3.1: Frequency of scrapes of Google News archived on Archive.org from September 2002 to the end of 2013

22 Since a website will be crawled more frequently if it is well-linked from other sites, the frequency of the Google News archives can be a proxy for how well-linked Google News was over the years, and its popularity. In June 2011, there was a major redesign of Google News resulting in a greater number of stories per page; the timing of the redesign accords with the time of increased archiving frequency. For the analysis of this chapter, I will use a dataset that I created from saved pages available at Archive.org. I was able to parse the HTML codes of those archived Google News pages from 2002 to 2013 and collect desired information with a Python package called Beautiful Soup. By analyzing how those HTML codes are structured and navigating through the parse trees generated by Beautiful Soup, I located and parsed out information such as date, time, section on the page, position under the section, title, URL link, and so on for each news entry appeared on Google News. Google News went through many design changes, both major and minor. Due to its changes, I made about 20 scripts. Taken altogether, they scraped all of the thousands of pages archived since 2002. The total number of observations for this data set is greater than one million. Since this data set generated from Archive.org is concerned with only the archives of one website, the Google News, and the archives were made entirely within the U.S., the previously discussed biases based on rate of inclusion and length of inclusion of the country in the archive do not apply.

23 Because Google News pages were archived at inconsistent frequency, the number of unique days per month with scrapes available varies, as shown in Table 3.1. To minimize the bias created by this uneven spread of data across time, I designed and implemented the following sample selection method: first, I divide my data into six-month periods and include in this data a maximum of one page per day. For consistency, I chose the page that is closest to 4pm for each available day 12. Then, I keep only the periods for which each month inside that period has more than 10 unique days with scrapes available. After, I randomly select 10 scrapes from each month and combine 12 month into one period, resulting a total of 7 periods, as shown in Table 3.1. Therefore, in each period, there is identical amount of scrapes; more specifically, there are 120 scrapes per period. By doing so, I lose a large number of observations, but I am able to make my data more evenly spread out across time. Those periods, ranging from 2004 to 2013, become time dummies in my analysis. Therefore, the gaps in between will not affect my analysis. Number of days Table 3.1 : Number of unique days per month with scrapes available Year Month Time Dummy Number of days Year Month Time Dummy 0 2002 7 / 1 2003 1 / 0 2002 8 / 1 2003 2 / 3 2002 9 / 2 2003 3 / 10 2002 10 / 1 2003 4 / 24 2002 11 / 1 2003 5 / 1 2002 12 / 3 2003 6 / 1 2003 7 / 1 2004 1 / 12 I choose 4pm since it is usually considered a prime news viewing time.

24 0 2003 8 / 1 2004 2 / 1 2003 9 / 1 2004 3 / 2 2003 10 / 0 2004 4 / 1 2003 11 / 3 2004 5 / 1 2003 12 / 18 2004 6 / 22 2004 7 Period 1 30 2005 1 Period 1 20 2004 8 Period 1 25 2005 2 Period 1 27 2004 9 Period 1 29 2005 3 Period 1 21 2004 10 Period 1 26 2005 4 Period 1 28 2004 11 Period 1 25 2005 5 Period 1 18 2004 12 Period 1 29 2005 6 Period 1 28 2005 7 Period 2 22 2006 1 Period 2 29 2005 8 Period 2 16 2006 2 Period 2 16 2005 9 Period 2 13 2006 3 Period 2 11 2005 10 Period 2 22 2006 4 Period 2 17 2005 11 Period 2 11 2006 5 Period 2 29 2005 12 Period 2 17 2006 6 Period 2 15 2006 7 / 4 2007 1 / 22 2006 8 / 5 2007 2 / 13 2006 9 / 6 2007 3 / 14 2006 10 / 8 2007 4 / 17 2006 11 / 14 2007 5 / 8 2006 12 / 11 2007 6 / 3 2007 7 / 0 2008 1 / 3 2007 8 / 14 2008 2 / 4 2007 9 / 6 2008 3 / 18 2007 10 / 5 2008 4 / 3 2007 11 / 16 2008 5 / 0 2007 12 / 6 2008 6 / 6 2008 7 / 28 2009 1 Period 3 3 2008 8 / 28 2009 2 Period 3 7 2008 9 / 23 2009 3 Period 3 6 2008 10 / 30 2009 4 Period 3 3 2008 11 / 31 2009 5 Period 3 17 2008 12 / 30 2009 6 Period 3 31 2009 7 Period 3 31 2010 1 Period 4 31 2009 8 Period 3 28 2010 2 Period 4 24 2009 9 Period 3 31 2010 3 Period 4 28 2009 10 Period 3 30 2010 4 Period 4

25 30 2009 11 Period 3 31 2010 5 Period 4 31 2009 12 Period 3 30 2010 6 Period 4 23 2010 7 Period 4 19 2011 1 Period 5 14 2010 8 Period 4 20 2011 2 Period 5 10 2010 9 Period 4 18 2011 3 Period 5 19 2010 10 Period 4 17 2011 4 Period 5 16 2010 11 Period 4 13 2011 5 Period 5 15 2010 12 Period 4 16 2011 6 Period 5 25 2011 7 Period 5 31 2012 1 Period 6 30 2011 8 Period 5 29 2012 2 Period 6 29 2011 9 Period 5 31 2012 3 Period 6 31 2011 10 Period 5 29 2012 4 Period 6 30 2011 11 Period 5 31 2012 5 Period 6 31 2011 12 Period 5 30 2012 6 Period 6 31 2012 7 Period 6 31 2013 1 Period 7 31 2012 8 Period 6 28 2013 2 Period 7 30 2012 9 Period 6 31 2013 3 Period 7 31 2012 10 Period 6 29 2013 4 Period 7 30 2012 11 Period 6 31 2013 5 Period 7 31 2012 12 Period 6 30 2013 6 Period 7 31 2013 7 Period 7 31 2013 8 Period 7 30 2013 9 Period 7 31 2013 10 Period 7 30 2013 11 Period 7 30 2013 12 Period 7 The following Table 3.2 shows a sample line of data (presented vertically here due to spatial constraint). The first item is a timestamp. The second item describes its position in WayBack s daily scrape sequence (No. 1, No. 2, No. 3 ). The third item shows to which section on the Google News page the observation belongs. The fourth item indicates the exact position under that section, in this case the third story within the first story block. The fifth is the news story s title. Sixth is the name of the news source. Seventh, I

26 have the full link to the original story and eighth is the domain name extracted from the full link. With more than a million observations in this format, I can count the number of times each domain appeared during each period of time. The resulting domain link frequency variable is one of the main variables I use in my analysis. Table 3.2 : Sample line of Google News data (08/16/09 18:33:21) No.2 Wayback scrape of the day Health 1_3 Connecticut To Make Swine Flu Vaccine Available Hartford Courant http://web.archive.org/web/20090816183321/http://www.courant.com/health/hcconnecticut-swine-flu-vaccine-0816,0,7661066.story www.courant.com 3.2 Summary Statistics Table 3.3 shows the number of unique domains per period. Although each period features the same amount of scrapes, the resign of Google News pages resulted in an increase in observations in the later periods. In this sample, there are a total of 5220 unique domains. Breaking the sample into periods, the number of unique domains in each period adds up to 9355. This suggests that there are 4135 occurrences of a domain appearing multiple times across different periods. Table 3.3: Number of unique domains by period Total 5220 Period 1 1111 Period 2 919 Period 3 841 Period 4 943 Period 5 1745 Period 6 1967 Period 7 1829

27 Table 3.4 presents the summary statistics of the number of links per domain by period. While I have a total of 5220 unique domains, for each period, I only count the domains that appear in that period. Therefore, the minimum number of links per domain by period is 1. For the earlier periods, 25%-50% of domains feature only 1 link, and the percentage is even larger for the later periods, as indicated by the 1 st quantile, median and 3 rd quantile statistics. Notice that the 3 rd quantile statistic is very low at 4 or 5 for all time periods; this suggests that the majority of domains have only a small number of links on Google News during each 12-month period. The numbers of links for the top domains are very large, relatively speaking, lifting the mean to a level well above not only the median but also the 3 rd quantile. Table 3.4: Summary of the link frequency by period Period Min 1 st Median Mean 3 rd Max quantile quantile Total 1.00 1.00 2.00 19.68 5.00 4750.00 Period 1 1.00 1.00 2.00 8.89 5.00 426.00 Period 2 1.00 1.00 2.00 9.86 5.00 375.00 Period 3 1.00 1.00 1.00 12.1 4.00 804.00 Period 4 1.00 1.00 1.00 12.01 4.00 955.00 Period 5 1.00 1.00 1.00 11.15 4.00 1020.00 Period 6 1.00 1.00 1.00 11.47 4.00 896.00 Period 7 1.00 1.00 1.00 11.09 4.00 923.00 Table 3.5, which lists the number of domains that exceeds a specific link frequency level, offers a more detailed picture of the distribution of link frequency. Over the entire sample, the first column indicates a domain s link frequency at a certain percentile (for instance, the domain at 60% percentile

28 has 2 links). Cells in each column contain the number of domains exceeding the corresponding link frequency level in each period. The absolute numbers of links shift with the number of unique domains for each period. From the percentage table I can see that, in later periods, there are relatively more domains with only one link and also relatively more domains that appear frequently (more than six times in each period). Table 3.5: The link frequency in link frequency quantiles by period Level of links Period 1 Period 2 Period 3 Period 4 Period 5 Period 6 Period 7 All domains (N=9,355) In absolute value > 0 1111 919 841 943 1745 1967 1829 > 1 639 507 402 454 842 949 870 Q0.60: > 2 446 373 260 307 572 671 615 Q0.70: > 3 347 296 214 249 445 515 483 Q0.80: > 6 150 126 107 118 211 243 260 Q0.90: > 15 106 90 88 96 158 182 198 Q0.95: > 39 47 53 45 46 80 91 96 Q0.99: > 195 7 8 9 9 21 23 16 In percentage > 0 1 1 1 1 1 1 1 > 1.575.552.478.481.483.482.475 Q0.60: > 2.401.406.309.326.328.341.336 Q0.70: > 3.312.322.254.264.256.262.264 Q0.80: > 6.135.137.127.125.121.124.142 Q0.90: > 15.095.098.105.102.091.093.108 Q0.95: > 39.042.058.054.049.046.046.054 Q0.99: > 195.006.009.011.010.012.011.009 Table 3.6, which shows the number of domains responsible for certain percentage of total number of links in each period, reinforces the results from previous tables. The majority of domains have only one link in each period;

29 however, the links from those domains constitute less than 10% of total links in any period. About 70% of all links in each period are from the top 5% domains. From those statistics, I can see that there is only a lengthened tail in the sense that that there are many small and less popular news websites (the domains with only one or two links). However, the tail is not fattened; those news sites are not receiving enough exposure in total to compare with the hit news websites to be truly considered part of a long tail (lengthened and fattened). Table 3.6: Number of domains responsible for certain percentage of total number of links Period 1 Period 2 Period 3 Period 4 Period 5 Period 6 Period 7 All domains (N=9,355) Number of 9872 9060 10176 11326 19462 22555 20278 links Number of 1111 919 841 943 1745 1967 1829 domains In absolute value >=.90 402 297 176 206 414 461 442 >=.80 190 138 71 82 157 176 192 >=.70 102 71 38 42 80 87 104 >=.60 59 44 22 24 45 48 58 >=.50 34 28 13 14 26 26 30 In percentage >=.90.362.323.209.218.237.234.242 >=.80.171.150.084.087.090.089.105 >=.70.092.077.045.045.046.044.057 >=.60.053.048.026.025.026.024.032 >=.50.031.030.015.015.015.013.016 After counting the number of links for each domain, I rank the domains by the link frequency. Domains with the same link frequencies are

30 assigned the same frequency rank. Table 3.7 shows summary statistics of the link frequency rank. Once again, I can see that a majority of domains are tied at the bottom ranks. Table 3.7: Summary of the link frequency rank by period Period Min 1 st Median Mean 3 rd Max quantile quantile Total 1.00 203.00 206.00 198.9 207.00 207.00 Period 1 1.00 72.00 75.00 70.7 76.00 76.00 Period 2 1.00 70.00 73.00 68.59 74.00 74.00 Period 3 1.00 69.00 72.00 66.96 72.00 72.00 Period 4 1.00 72.00 75.00 69.88 75.00 75.00 Period 5 1.00 98.00 101.00 95.57 101.00 101.00 Period 6 1.00 106.00 109.00 103.5 109.00 109.00 Period 7 1.00 96.00 99.00 93.33 99.00 99.00 Figure 3.2 shows the relationship between the log of link frequency and the log of link frequency rank for each domain in each period. Similar to the findings of Brynjolfsson, Hu and Smith (2010), the slope between the log of link frequency and the log of link frequency rank is not constant. It drops at a faster rate than a log-linear curve indicates and the slope becomes steeper as the rank increases, suggesting a cluster of a few dominant domains and a relatively short tail. The fitted lines will be discussed later.

31 Figure 3.2: Scatterplot of log of the link frequency vs. log of the link frequency rank 13 The following tables are concerned with the domain characteristics. A top-level domain is one of the domains at the highest level in the hierarchical Domain Name System of the Internet. For example, in the domain name www.example.com, the top-level domain is com. Com means top level domain for commerce; uk is the top level domain for the United Kingdom; net means the top level domain originally for network providers and org is the top level domain for non-profit organizations. The top level domains listed in Table 3.8 appear frequently in my data sample; the vast majority are com domains. The 13 The log of the link frequency is on the y-axis; the log of the link frequency rank is on the x- axis.

32 other top level domains only appear a few times in each period so that they will not affect my analysis. I will test the effect of having those top level domains as one of the domain characteristics. Table 3.8: Difference in top level domain by period Top level com uk net org domain Period 1 (1111) 853 59 42 28 Period 2 (919) 723 43 30 25 Period 3 (841) 743 22 14 14 Period 4 (943) 847 24 17 22 Period 5 (1745) 1507 29 32 48 Period 6 (1967) 1615 35 40 80 Period 7 (1829) 1525 47 24 70 As mentioned earlier, a domain s frequent appearance in one of the sections is an indication of its intrinsic characteristic. A story usually falls into one of these sections: top stories, small section, business, entertainment, health, sci/tech, sports, U.S. or World. 14 The nature of most of these categories is self-explanatory; small section describes stories on Google s sidebar, on the right side of each Google News page. 15 If a domain only appears once or twice in an entire period, there are simply too few observations to merit a frequent appearance in any section. Therefore, I created a subset of my sample for domains with at least 5 or more links in each period. Then, for each of these domains, I calculate the frequency in which it appears in each section. Table 3.9 is the summary statistics of the 14 Because of differences in page design, sometimes there are stories appeared under a section that does not belong to my list; however, those cases are rare. 15 In the HTML code of Google News pages, Google named the sidebar the small section.

33 frequency of domain s appearance in each section. As for each section there are many domains with zero link, I exclude those domains from each section. In later analysis, I introduce a dummy defining top stories as having appearance frequency greater than the mean in that section, similarly for the other sections. Table 3.9: Summary of domain's section characteristic (percentage) (none-zero) Section Min 1st quantile Median Mean 3rd quantile Max Top Stories 1.00 11.00 18.00 21.25 27.00 100.00 Small Section 1.00 7.25 14.00 24.44 29.00 100.00 Business 1.00 3.00 8.00 14.47 17.00 100.00 Entertainment 1.00 6.00 14.00 26.58 38.25 100.00 Health 1.00 5.00 11.00 16.67 20.00 100.00 Sci/Tech 1.00 5.00 14.00 28.66 50.00 100.00 Sports 1.00 7.25 17.50 27.71 40.00 100.00 U.S. 1.00 6.00 14.00 15.79 20.00 100.00 World 1.00 7.00 17.00 24.71 33.50 100.00 3.3 Empirical Models and Results First, adopting the method of Brynjolfsson, Hu, and Smith (2010), I estimate the log-linear relationship between domain frequency and domain frequency rank on Google News. The linear model I use is: ln(link frequency) = ββ 0 + ββ 1 ln(link rank) I run this regression for each period. Domains with the same link frequencies are assigned the same frequency rank. The results are reported in Table 3.10.

34 Table 3.10: Regression result simple regression I see that the slope of the log of link frequency rank is getting steeper over time 16 so that as a domain s link frequency rank increases, its link frequency drops faster for the later periods. However, the relationship between link frequency and link rank appears log-concave, not log-linear, as discussed earlier. I follow by running OLS regression with quadratic terms. The regression model I use is: ln(link frequency) = ββ 0 + ββ 1 ln(link rank) + ββ 2 ln(link rank)^2 16 Scatterplots featuring this relationship with fitted lines were previously shown as Figure 3.2.

35 Again, I run this regression for each period. The results are reported in Table 3.11. Table 3.11: Regression result OLS regression with quadratic terms This set of regression offers a more detailed picture. The fit is better than the previous simple regression as R 2 increases for each period. From the linear term I observe that the head of the distribution is getting fatter; in the later periods, the top ranked domains get more links because of the higher coefficients on lrank. As I move to the right of the distribution, the number of

36 links received by those less-favorably ranked websites drops more quickly in the later period. This confirms the findings from the summary statistics. I observe the tail lengthening over time, in the sense that there are an increasing amount of small or niche websites getting linked to Google News over time; however, each of these small or niche websites is featured only a few times, while the large and top-ranked news websites get even greater exposure. In other words, the tail is lengthened but not fattened. Therefore, my results fail to support the long tail hypothesis, as the small and lesserknown news sites are receiving disproportionately less exposure than the top players from Google News. In order to further explore the difference in link frequency for domains with different link frequency rank and test the effect of domains characteristics, I follow a quantile regression model used in a similar situation by Elberse and Oberholzer-Gee (2007). In a quantile regression model, a specified conditional quantile of the outcome variable is expressed as a linear function of observed covariates. By examining multiple quantiles, I can observe how the distribution changes with covariates, allowing richer inferences. Quantile regression cannot be achieved by simply segmenting the response variable into subsets according to its unconditional distribution and then doing least squares fitting on these subsets. It is not a form of truncation on the dependent variable; instead, quantile regression can be achieved through optimization (Koenker & Hallock, 2001). They explain that just as I

37 can define the sample mean as the solution to the problem of minimizing a sum of squared residuals, I can define the median as the solution to the problem of minimizing a sum of absolute residuals. The symmetry of the piecewise linear absolute value function implies that the minimization of the sum of absolute residuals must equate the number of positive and negative residuals, thus assuring that there are the same number of observations above and below the median. Since the symmetry of the absolute value yields the median, minimizing a sum of asymmetrically weighted absolute residuals simply giving differing weights to positive and negative residuals would yield the quantiles. 17 I estimate models of the following general form: QQ θθ (yy xx) = xx ββ(θθ) Where QQ θθ (yy xx) donates the θθ tth quantile of the distribution of y, the log of link frequency in each period for each domain, given a vector x of covariates. To identify the emergence of a long tail in this setting, the covariates include a set of time dummies for each period. To control for the domain characteristics, the covariates also include a set of domain characteristics dummies. Table 3.12 shows the result of a series of quantile regressions of the log of link frequency against only the time dummies. All models omit a dummy for Period 1 since it is the base period. The intercept term is the mean of the link frequency in each quantile and coefficients for each period dummy 17 More detailed calculations can be found in Koenker & Hallock (2001).