WEB SCRAPING LEGAL ISSUES AND BEST PRACTICES FOR PUBLISHERS

1 WEB SCRAPING LEGAL ISSUES AND BEST PRACTICES FOR PUBLISHERS CEO Topics December 9, 2016

2 Table of Contents Why This Topic 3 Methodology 3 Definition and Use Cases 4 Sample Use Cases 4 Legal Considerations 5 Selected Legal Cases 6 Essential Actions 7

3 Why This Topic Web scraping is an important tool for information businesses, both for building products and for internal uses such as competitive intelligence. The volume of information available on the open web and the ever-increasing sophistication of web scraping technologies allows organizations of all sizes to gather large quantities of rich data for analytics purposes. The sources of scraped material are myriad, ranging from company websites to social media to government data sets, as are the reasons for conducting web scraping. Web scraping activity often sets the interests of a website owner against those who want to harvest the data. Owners usually want to control, profit from, or leverage the data they have on their website. Those who collect data may want to use it for building new data-driven information solutions of their own, or for internal uses such as competitive intelligence, analysis reasons often completely unconnected to the original purpose of publication on the site. As the use of big data analytics to find patterns and trends hidden in unstructured or seemingly unrelated data sets continues to grow, the degree and volume of web scraping also continues to increase. Where many of the first court cases involving web scraping centered around simple copying, republishing, or linking to scraped data for competitive purposes, the use of data analytics tools has broadened the playing field of possible uses of scraped material, and it is likely to make interpretations of liability more nuanced and driven by the circumstance of each specific case. Outsell examined the use of web scraping and profiled leading suppliers of such services in our 2015 report Evaluating Automated Content Tools. In this report, we look at the definition of web scraping and some use cases in current practice, followed by a review of the legal considerations and theories of liability by which website owners have tried to sue or prevent web scraping activity. We also present some sample cases illustrating those theories. Finally, we look at some best practices for companies that use or are considering the use of web scraping as part of their business intelligence or product development practices. It s essential for leaders to consider this evolving mechanism for driving data-driven solutions and be clear on the current state of potential liabilities and best practices. Methodology Primary research for this report comprised a series of 10 interviews with information managers, business intelligence researchers, and legal practitioners. We supplemented those interviews with secondary research from published reports, industry blogs, mainstream press, and blogs. Outsell s ongoing dialogue with the market added depth to the primary and secondary research, as did our regular coverage of state-of-the-art vendors and emerging companies who are using this technique to develop new commercial offerings.

4 Definition and Use Cases If web content can be viewed on a page, it can be scraped. Web scraping, sometimes called web harvesting or web data extraction, is a software technique used to copy or collect selected information from websites. Automatic crawling of websites is widespread, with crawlers (or robots or spiders ) extracting varying amounts of data and information. Some companies use web scraping to harvest publicly available data (such as financial data, weather, news, company information, or sports scores) to enhance existing offerings or build completely new products. Others use it to watch trends, gather competitive intelligence and product information, monitor changes in websites, or gain strategic insight. Web search engines use a similar technique to index websites, capturing links and short snippets of information, in order to catalogue the data and provide accurate search results. Websites With HTML Pages Web Scraping Structured Sample Use Cases Real Estate: Scraping housing sale or rental data from online real estate listings in order to track pricing trends over time, average prices, or uncover indicators useful to real estate agents, builders, investors, etc. Legislative or Regulatory: Scraping government or regulatory agency websites to discover, understand, or monitor changes or developments with legislation or regulations. Health Care: Scraping healthcare forums and blogs to gather conversations or comments about particular drugs, medical devices, or treatments in order to monitor brand reputation, adverse effects, or other clinical information. Law Enforcement: Identifying patterns that ensure compliance or help ferret out illegal activity. Lead Generation or company tracking: maintaining near real-time understanding of companies activities, officers, contact information and other critical details about products and services that are aggregated vertically or for broad multi-industry use. Public libraries: Archiving and preserving local or commercial history. University libraries: Archiving course syllabi and campus activities. Internet archives: Compiling free books, movies, music, software, etc., or simply archiving website to compile a history of the web. Numerous illicit applications and those used for hacking also employ web scraping. Corporate espionage, theft of intellectual property, wholesale copying of websites, or impacting website performance are all examples of employing scraping with an intent to cause damage or loss.

5 Legal Considerations The legal landscape around web scraping is still evolving. Under certain circumstances, outlined below, US courts have found that the activity is legal and something that website operators should expect. The activity touches on many of the most sensitive legal and political areas of the digital era: The ability to protect and enforce copyright when technology makes copying and sharing so easy. The power of Google, Facebook, and others as they become not just pointers linking to content but the home of that content. The ownership of user generated content shared on social platforms. The protection of personal data. By some estimates there have been fewer than 100 cases involving web crawling in the US. One of the reasons is that plaintiffs have been unlikely to win such cases, and it s hard to obtain compensation in the event of a favorable judgment. Therefore, website owners are reluctant to incur the costs of legal action even if there is clearly a web crawler violating the terms of access to their site. There are several theories of infringement or liability by which owners of websites have brought legal action against web scrapers. However, the case law is still very much in flux. The most prominent cases involved one or more of the following: Breach of Terms and Conditions Many websites post terms and conditions (T&Cs) prominently on their pages, addressing the issue of access to their website via scrapers. This is intended to create breach of contract liability by establishing a contract between the website owner and the scraper. But posting T&Cs (also known as a browse-wrap ) may not be enough to show that a scraper has breached the terms of the website, since there is no active acceptance on the part of the scraper. What appears to be more enforceable is the use of a click-wrap in which the web scraper has to actively click to accept the T&Cs, meaning there is proof of acceptance of the T&Cs. Copyright or Trademark Infringement In the US, the legal doctrine of fair use allows limited use of copyrighted material under certain conditions without the explicit permission of the copyright holder. Uses for such purposes as parody, criticism, commentary, or academic research are regarded as fair use. Legal precedent and the 1976 Copyright Act set out four specific factors to determine whether use of copyright material by scraping or other means is fair use, though they may also take other considerations into account. Those four main factors are: The purpose of the use, including whether it is for personal or commercial use, and the degree to which it is transformative in some way. The nature of the copyrighted work. The volume of the material scraped in relation to the copyrighted work as a whole. The effect of the use on the potential market for or value of the copyrighted work. Computer Fraud and Abuse Act (CFAA) There are several federal and state laws against hacking or accessing another s computer. The CFAA states that whoever intentionally accesses a computer without authorization and as a result of such conduct recklessly causes damage is basically in violation, especially if the violated website can prove loss or damages. Trespass to Chattels This is a term referring to a civil wrong which means one entity has interfered with another s personal property which causes loss of value or damages. Robots Exclusion Protocol This is an industry standard program that allows a website to embed robots.txt files within their websites that communicate instructions to web crawlers to indicate which crawlers can access the site, and which pages they can access. While a common protocol, it has limited legal value, mainly as supporting evidence in cases involving breach of terms and conditionshas limited legal value, mainly as supporting evidence in cases involving breach of terms and conditions.

6 The situation in the European Union is governed by different legislation and legal systems, but many of the same principles apply, for example relating to terms and conditions. Website and database owners have tended to rely on copyright infringement claims against screen scrapers there are multiple EU directives on copyright but there has been little case law to provide guidance. Some key provisions are: The EU Database Directive of 1996 provides legal protection for the creators of databases that are not covered by intellectual property rights, so that elements of a database that are not the author s own original creation are protected. In particular, it provides protection where there has been qualitatively and/or quantitatively a substantial investment in either the obtaining, verification or presentation of the contents. A 2015 ruling by the European Court of Justice in a case concerning the airline Ryanair greatly strengthened the ability of website operators to protect their content through contractual terms and conditions (see below) when they are not covered by the database directive. In addition to intellectual property rights infringement, in theory website owners have other legal arguments against web scraping. As in the US, in the UK a website owner could try to bring a common law tort claim for trespass to chattels. A website owner in the UK could also seek to rely on the Computer Misuse Act 1990 which prohibits unauthorized access to, or modification of, computer material. However, unlike the US, neither of these arguments has been tested in the UK courts specifically in relation to website scraping. Selected Legal Cases While theories of liability are still developing, there is some evidence that US courts are leaning towards protecting proprietary content, especially when there is some indication of damages suffered by site owners. On the other hand, vast amounts of web scraping and content harvesting take place every day and never face challenges. The use of big data and open data is not unregulated, but it breeds a culture of find and re-use that can become pervasive. Circumstances vary in each case and depend on how the scraper accessed content, what the scraper harvested, and the terms of use presented by the website owner. The following are a sample of issues and resolutions illustrating the range of liabilities. Associated Press vs. Meltwater This high-profile case was decided for Associated Press (AP). The media-monitoring firm Meltwater had been crawling AP websites and had extracted and republished significant amounts of text from AP news articles without adding any commentary or insight. Meltwater s stance was that it was operating under fair use provisions of copyright laws. Although the court found in favor of AP, the firms settled the case and entered into a partnership arrangement to develop new products. Ticketmaster vs. Riedel Marketing Group Riedel Marketing Group (RMG) used web robots to saturate the Ticketmaster site so that it could harvest large quantities of desirable tickets for resale. Ticketmaster argued that RMG had agreed to the terms and conditions of the site but ignored them. Ticketmaster obtained a preliminary injunction against RMG, and the court held that RMG had infringed on Ticketmaster s copyrighted material. ebay vs. Bidder s Edge Bidder s Edge used a web scraper to obtain details of active ebay auctions and reposted them on its own site. In addition, it crawled ebay s site more than 100,000 times per day, which hindered ebay s site performance. Bidder s Edge ignored the robots.txt files as well as a cease and desist letter from ebay, and ebay brought a trespass to chattels claim against Bidder s Edge. ebay successfully obtained a preliminary injunction against Bidder s Edge, since the court decided that Bidder s Edge spiders interfered with the capacity of ebay s servers. QVC vs. Resultly QVC s terms and conditions did not prohibit web scraping, but Resultly was so aggressive with its scraping that QVC s servers became overburdened, and customers were unable to complete transactions. QVC alleged a loss of $2 million, and sued Resultly

7 under the Computer Fraud and Abuse Act (CFAA), seeking a preliminary injunction. The court ruled that that the plaintiff had to prove that the defendant both knowingly and intentionally intended to cause damage to the plaintiff s computer. In this case the court found that Resultly did not intend to cause damage, so QVC was unable to show that Resultly violated the CFAA. Craigslist vs. Naturemarket Due to the extensive amount and ever-changing listings on Craigslist, the site has been the target of numerous scrapers over the years, and it has successfully sued or stopped numerous organizations from scraping data. In one case, Naturemarket sold software that made it easy for customers to automatically post listings to Craigslist and to post listings on behalf of customers. Naturemarket also scraped email addresses from Craigslist s site. Craigslist sued, using copyright infringement, the Computer Fraud and Abuse Act, and breach of terms of use. Naturemarket failed to contest the suit, and the court awarded Craigslist a judgment of $1.3 million. Google vs. Authors Guild A related case, though not directly concerning web scraping. The Second Circuit of the Court of Appeal held that Google s book scanning of millions of books is fair use even though they have copyright protection, because of the transformative nature of its behavior under the fair use doctrine. The court also confirmed that facts are not protected by copyright, suggesting that harvesting factual data from a website is not of itself an infringement. Ryanair Ltd vs. PR Aviation BV The most important EU case. In its 2015 judgment, the European Court of Justice ruled that owners of publicly available databases, which do not fall under the protection of the Database Directive, are free to restrict the use of the data through contractual terms on their website. The case was brought by the budget airline Ryanair against PR Aviation, a travel website operator that extracted data from the Ryanair s website in order to compare prices and book flights on payment of a commission. Ryanair required anyone accessing flight data on its website to tick a box to accept its terms and conditions, which included a prohibition on the automated extraction of data from their website for commercial purposes, without the airline s permission. The court found that the Database Directive did not protect Ryanair, but ironically the airline has gained greater protection as a result because it is free to create contractual limits on the use of its database. Businesses that carry out screen scraping activities in the EU are at an increased risk of being sued for breach of contract by database owners as a result. All a website or database owner needs to do to lawfully protect their data is to include terms and conditions on their website that prohibit screen scraping and require users to accept the terms before using the site where the relevant data is located.

8 Essential Actions and Best Practices Web scraping can be a useful and constructive tool for gathering unstructured information to use for insights on multiple levels. While the legal landscape remains unsettled, and uncertainties abound in terms of what is permissible with web scraping, in Outsell s opinion there are legal, ethical, and social boundaries that are important to maintain. Companies that conduct scraping activities must carefully consider the consequences of their actions. As in many grey areas of business, it s as much a question of managing the risk as of knowing what may or may not turn out to be legal. The key question often becomes How likely are we to be sued and how big is the risk? The following items are best practices that can limit risk or liability when conducting web scraping: Comply with terms and conditions and robots.txt: If there is an advisory in the form of a robots.txt file, or terms and conditions specified which permits, limits, or prohibits web scraping, it s best to stay in compliance and not breach the implied contract. Don t overburden the website: Make sure that excessive queries don t interfere with a website s normal processes, slow a site s performance, or cause unintended harm to the website operator. Copy as little as possible: Avoid making complete copies of web pages or design elements from the site. Focus on factual information such as pricing data, dates, locations, etc. When it comes to non-factual information, store it only as long as required for analysis. One company that we interviewed has a policy of keeping non-factual, competitive information for only 30 days, during which they analyze it, and then discard it. Don t plagiarize: Don t take information from a site and redistribute it or use it verbatim. Transform it, analyze it, or use it in different ways than the original site. Avoid crawling or storing sensitive data: Certain types of non-public information or personally identifiable information, such as names, birth dates, or email addresses, are off-limits. Don t compete: Avoid using scraped data to offer a competing product or service, which can potentially cause damage to the scraped website. Be transparent: Scraping technology is such that it s possible to put a company s contact details in the scraper s header to identify who is doing the scraping. Using technical trickery to mask IP addresses, bypass CAPTCHAs, or otherwise conceal or filter activities is off limits. Don t misrepresent one s identity or purpose, or use deception in order to gain access. Use an API if it s there: Many websites offer the ability to download data via an API, either free or for a fee. If that is the case, it s preferable to obtain the data that way, rather than by scraping. Obey cease and desist requests: If a website operator wants the scraping to stop, acknowledge and stop. In the long run, it s better to adhere to the request than to risk liability. Keep up-to-date: The legal landscape is still in its formative stages and there are no clear-cut patterns in the decisions. In addition, intellectual property laws are not uniform among various jurisdictions or countries. It is important to keep abreast of the developing case law centered on web scraping to mitigate risk. Develop a written company policy: Web scraping is a normal and expected business activity. It s not possible to eliminate risk altogether, but establishing and adhering to clear guidelines for web scraping activities is critical from both a legal and ethical standpoint.

9 Related Research Reports Text and Data Mining: Technologies Under Construction January 1, 2016 Data Business Fundamentals Novemeber 24, 2015 Evaluating Automated Content Tools August 13, 2015 Insights Researcher Open Data Practices More Prevalent Than Previously Thought Novemeber 21, 2016 European Commission Proposes Controversial Changes to EU Copyright Laws October 26, 2016 Fair Use in Copyright Law Open to Exploitation October 25, 2015

10 ABOUT THE AUTHOR Simon Alterman VP & Lead Analyst, Manager of Outsell s Leadership Community +44 (0)20 7419 2352 salterman@outsellinc.com See additional reports published in all coverage areas. Does this report meet your needs? Send us your feedback. ABOUT OUTSELL The rapid convergence of information, media, technology, and data is reshaping businesses every day. Enter Outsell, Inc., the only research and advisory firm focusing on these four sectors. As the trusted advisor to executives, our analysts turn complexity into clarity, and provide the facts and insights necessary to make the right decisions. Our proven blend of big data, research, proprietary intelligence, and exclusive leadership communities produces tangible results and a strong ROI. We promise to deliver wow and ensure our clients stay more focused, save time, and grow revenue in a fast-changing digital world. www.outsellinc.com contact_us@outsellinc.com Burlingame, CA USA +1 650-342-6060 London, United Kingdom +44 (0)20 8090 6590 Outsell, Inc. is the sole and exclusive owner of all copyrights and content in this report. As a user of this report, you acknowledge that you are a licensee of Outsell s copyrights and that Outsell, Inc. retains title to all Outsell copyrights in the report. You may use this report, only within your own work group in your company. For broader distribution right options, please email us at info@outsellinc.com. The information, analysis, and opinions (the Content ) contained herein are based on the qualitative and quantitative research methods of Outsell, Inc. and its staff s extensive professional expertise in the industry. Outsell has used its best ef- forts and judgment in the compilation and presentation of the Content and to ensure to the best of its ability that the Content is accurate as of the date published. However, the industry information covered by this report is subject to rapid change. Outsell makes no representations or warranties, express or implied, concerning or relating to the accuracy of the Content in this report and Outsell assumes no liability related to claims concerning the Content of this report. Advancing the Business of Information 2016 Outsell, Inc.