WEB SCRAPING LEGAL ISSUES AND BEST PRACTICES FOR PUBLISHERS

Similar documents
Acceptance of Terms. Terms of Service. Privacy Policy. Terms Applicable to All Products and Services. Last Updated: January 24, 2014

Covered California. Terms and Conditions of Use

Terms of Service. Your Information and Privacy

Terms and conditions of use

ETPL Extract, Transform, Predict and Load

Beyond Market Research

Hibbett Sports Messaging Service (SMS) Terms and Conditions

Terms and Conditions

LICENSE AGREEMENT FOR TOBII ANALYTICS SOFTWARE DEVELOPMENT KIT AND API

Legal Terms and Conditions of

Please read these Terms and Conditions carefully. They Govern your access and use of our Website and services on it.

Terms Of Service. "The Company" means CIT Broadband, P.O. Box , Fort Worth, TX

Internet Reputation Management Guidelines Building a Roadmap for Continued Success

FAX-TO- END-USER LICENSE AGREEMENT

Internet Services Terms and Conditions

Kings Estate Agents - Terms of Use

Chartis RiskTech Quadrant for Model Risk Management Systems 2014

SMALL BUSINESS OWNERS and THE CANADIAN LEGAL SYSTEM

PLEASE READ THESE TERMS AND CONDITIONS OF USE CAREFULLY. THESE TERMS AND CONDITIONS MAY HAVE CHANGED SINCE USER S LAST VISIT TO THIS SITE.

UK Vendor Website - Terms of Use Agreement

Internet Reputation Management Guide. Building a Roadmap for Continued Success

Wakefield Public Schools Technology Acceptable Use Policy

Fair Debt Collection Practices Act

Aitoc Software LLC License Agreement for Magento Extensions

Terms and Conditions

Your Content refers to the information that you wish to transfer using our Services.

TERMS OF USE / LEGAL NOTICE FOR PENNSYLVANIA AMBULATORY SURGERY ASSOCIATION SITE

Rethinking Schools Limited Institutional Site License

You must not: (a) Copy and republish material from this website (including republication on another website);

ELITEPAY TERMS AND CONDITIONS OF SERVICE EFFECTIVE: November 15, 2014

Without prejudice to the generality of the foregoing paragraph, The Gallery Tattoo Studio does not warrant that:

THE BUSINESS COUNCIL OF WESTCHESTER Website & Internet Services Terms And Conditions of Use

SOFTWARE LICENSE AND NON-DISCLOSURE AGREEMENT

Terms of Use. Please Read Carefully Before Using This Website and Provided Services and Products:

TERMS OF USE TomTom Websites

Website & Hosting Terms & Conditions

PRIVACY POLICY. comply with the Australian Privacy Principles ("APPs"); ensure that we manage your personal information openly and transparently;

Website Terms and Conditions. by SEQ Legal

PHP POINT OF SALE TERMS OF USE

Terms of Use Table of Contents 1. General Information 2. Your Agreement to the Terms 3. Changes to the Terms 4. Provision of the Website

o Recommedation services

Terms and Conditions. Introduction

Our Terms for Website Design, Development, Hosting and Promotional Services

SUPPLY AGREEMENT. SLA.doc 1

Art-list License Agreement

Wealth Training Pty Ltd - Right to Cancel Notice

IMPORTANT IT IS DEAMED THAT YOU HAVE READ AND AGREE TO ALL TERMS & CONDITIONS BEFORE USING THIS WEBSITE.

These TERMS AND CONDICTIONS (this Agreement ) are agreed to between InfluencersAtWork,

WEBSITE TERMS & CONDITIONS. Last updated March 27, 2015

TaxSaverNetwork. Terms of Service

ADHA NBR End User Licensing Agreement

JPMA - Terms and Conditions

Terms and Conditions for TomTom Content Services of TomTom International B.V. Rembrandtplein 35, 1017 CT Amsterdam, The Netherlands ( TomTom )

Learn How to Defend Your Online Marketplace from Unwanted Traffic

MEDJOBBERS.COM & JOBBERS INC TERMS AND CONDITIONS

London LAWN Terms of Service

An Introduction to the Legal Issues Surrounding Open Source Software

User Agreement. Quality. Value. Efficiency.

MAGUSA LOGISTICS WEBSITE TERMS AND CONDITIONS

RELOCATEYOURSELF.COM B.V - TERMS OF USE OF SERVICES

VII 3.1. VII. Unfair and Deceptive Practices FDCPA. Fair Debt Collection Practices Act. Introduction. Communications Connected with Debt Collection

Terms and Conditions

PEER-TO-PEER FILE SHARING POLICY

GestInTime GESTINTIME SERVICE SaaS End- User License Agreement (EULA) IMPORTANT - READ CAREFULLY: This GESTINTIME SERVICE End- User License Agreement

UNITED STATES DISTRICT COURT DISTRICT OF MASSACHUSETTS. ) IATRIC SYSTEMS, INC., ) ) ) Civil Action No. 1:14-cv ) v. ) ) FAIRWARNING, INC.

purchased and is using the Products including the online classroom ("Customer" or "You") and the individuals accessing the Products ("End Users").

Broward County Website Terms of Use

NPSA GENERAL PROVISIONS

Data Subscription Service. Data descriptions Order form Licence agreement

Direct Marketing Rules

GENOA, a QOL HEALTHCARE COMPANY WEBSITE TERMS OF USE

This agreement applies to all users of Historica Canada websites and other social media tools ( social media tools or social media channels ).

Sycamore Leaf Solutions LLC

TERMS & CONDITIONS. Introduction

CREATIVE OPPORTUNITIES WEBSITE TERMS OF USE RECRUITERS

Service Monitoring Discrimination. Prohibited Uses and Activities Spamming Intellectual Property Violations 5

AGREEMENT AND TERMS OF USE

GlaxoSmithKline Single Sign On Portal for ClearView and Campaign Tracker - Terms of Use

Bank Independent Online Financial Management Addendum

IBM Data Security Services for endpoint data protection endpoint data loss prevention solution

Robinhood Terms & Conditions

WCCM XII & APCOM VI Secretariat is in Seoul. Your use of will always be

PRIVACY POLICY. I. Introduction. II. Information We Collect

EMBEDDING BCM IN THE ORGANIZATION S CULTURE

TERMS and CONDITIONS OF USE - NextSTEPS TM

Terms of Use Gateway Clipper Website

Web site Terms and Conditions

Online Study Affiliate Marketing Agreement

Terms and conditions of use

Terms and Conditions for Membership and Use, between Heritage Matrimonials and the Customer, and any Third Party.

Privacy Policy and Terms of Use

This page outlines the General Conditions of access to and use of the Site. Acknowledgment and acceptance of General Conditions

APPROPRIATE USE OF INFORMATION POLICY 3511 TECHNOLOGY RESOURCES ADOPTED: 06/17/08 PAGE 1 of 5

INTERNATIONAL MONEY EXPRESS (IME) LIMITED ONLINE REMIT USER AGREEMENT

ELKHART COUNTY BOARD OF REALTORS AND MULTIPLE LISTING SERVICE OF ELKHART COUNTY INC. VIRTUAL OFFICE WEBSITE (VOW) LICENSE AGREEMENT

(the "Website") is provided by Your Choice Counselling.

If a Client and a Freelancer enter an independent contractor relationship, then this Freelancer Agreement ( Freelancer Agreement ) will apply.

Privacy Policy. Effective Date: November 20, 2014

Online Business Terms and Conditions - A Brief Glossary

AGREEMENT BETWEEN USER AND International Network of Spinal Cord Injury Nurses

Transcription:

1 WEB SCRAPING LEGAL ISSUES AND BEST PRACTICES FOR PUBLISHERS CEO Topics December 9, 2016

2 Table of Contents Why This Topic 3 Methodology 3 Definition and Use Cases 4 Sample Use Cases 4 Legal Considerations 5 Selected Legal Cases 6 Essential Actions 7

3 Why This Topic Web scraping is an important tool for information businesses, both for building products and for internal uses such as competitive intelligence. The volume of information available on the open web and the ever-increasing sophistication of web scraping technologies allows organizations of all sizes to gather large quantities of rich data for analytics purposes. The sources of scraped material are myriad, ranging from company websites to social media to government data sets, as are the reasons for conducting web scraping. Web scraping activity often sets the interests of a website owner against those who want to harvest the data. Owners usually want to control, profit from, or leverage the data they have on their website. Those who collect data may want to use it for building new data-driven information solutions of their own, or for internal uses such as competitive intelligence, analysis reasons often completely unconnected to the original purpose of publication on the site. As the use of big data analytics to find patterns and trends hidden in unstructured or seemingly unrelated data sets continues to grow, the degree and volume of web scraping also continues to increase. Where many of the first court cases involving web scraping centered around simple copying, republishing, or linking to scraped data for competitive purposes, the use of data analytics tools has broadened the playing field of possible uses of scraped material, and it is likely to make interpretations of liability more nuanced and driven by the circumstance of each specific case. Outsell examined the use of web scraping and profiled leading suppliers of such services in our 2015 report Evaluating Automated Content Tools. In this report, we look at the definition of web scraping and some use cases in current practice, followed by a review of the legal considerations and theories of liability by which website owners have tried to sue or prevent web scraping activity. We also present some sample cases illustrating those theories. Finally, we look at some best practices for companies that use or are considering the use of web scraping as part of their business intelligence or product development practices. It s essential for leaders to consider this evolving mechanism for driving data-driven solutions and be clear on the current state of potential liabilities and best practices. Methodology Primary research for this report comprised a series of 10 interviews with information managers, business intelligence researchers, and legal practitioners. We supplemented those interviews with secondary research from published reports, industry blogs, mainstream press, and blogs. Outsell s ongoing dialogue with the market added depth to the primary and secondary research, as did our regular coverage of state-of-the-art vendors and emerging companies who are using this technique to develop new commercial offerings.

4 Definition and Use Cases If web content can be viewed on a page, it can be scraped. Web scraping, sometimes called web harvesting or web data extraction, is a software technique used to copy or collect selected information from websites. Automatic crawling of websites is widespread, with crawlers (or robots or spiders ) extracting varying amounts of data and information. Some companies use web scraping to harvest publicly available data (such as financial data, weather, news, company information, or sports scores) to enhance existing offerings or build completely new products. Others use it to watch trends, gather competitive intelligence and product information, monitor changes in websites, or gain strategic insight. Web search engines use a similar technique to index websites, capturing links and short snippets of information, in order to catalogue the data and provide accurate search results. Websites With HTML Pages Web Scraping Structured Sample Use Cases Real Estate: Scraping housing sale or rental data from online real estate listings in order to track pricing trends over time, average prices, or uncover indicators useful to real estate agents, builders, investors, etc. Legislative or Regulatory: Scraping government or regulatory agency websites to discover, understand, or monitor changes or developments with legislation or regulations. Health Care: Scraping healthcare forums and blogs to gather conversations or comments about particular drugs, medical devices, or treatments in order to monitor brand reputation, adverse effects, or other clinical information. Law Enforcement: Identifying patterns that ensure compliance or help ferret out illegal activity. Lead Generation or company tracking: maintaining near real-time understanding of companies activities, officers, contact information and other critical details about products and services that are aggregated vertically or for broad multi-industry use. Public libraries: Archiving and preserving local or commercial history. University libraries: Archiving course syllabi and campus activities. Internet archives: Compiling free books, movies, music, software, etc., or simply archiving website to compile a history of the web. Numerous illicit applications and those used for hacking also employ web scraping. Corporate espionage, theft of intellectual property, wholesale copying of websites, or impacting website performance are all examples of employing scraping with an intent to cause damage or loss.

5 Legal Considerations The legal landscape around web scraping is still evolving. Under certain circumstances, outlined below, US courts have found that the activity is legal and something that website operators should expect. The activity touches on many of the most sensitive legal and political areas of the digital era: The ability to protect and enforce copyright when technology makes copying and sharing so easy. The power of Google, Facebook, and others as they become not just pointers linking to content but the home of that content. The ownership of user generated content shared on social platforms. The protection of personal data. By some estimates there have been fewer than 100 cases involving web crawling in the US. One of the reasons is that plaintiffs have been unlikely to win such cases, and it s hard to obtain compensation in the event of a favorable judgment. Therefore, website owners are reluctant to incur the costs of legal action even if there is clearly a web crawler violating the terms of access to their site. There are several theories of infringement or liability by which owners of websites have brought legal action against web scrapers. However, the case law is still very much in flux. The most prominent cases involved one or more of the following: Breach of Terms and Conditions Many websites post terms and conditions (T&Cs) prominently on their pages, addressing the issue of access to their website via scrapers. This is intended to create breach of contract liability by establishing a contract between the website owner and the scraper. But posting T&Cs (also known as a browse-wrap ) may not be enough to show that a scraper has breached the terms of the website, since there is no active acceptance on the part of the scraper. What appears to be more enforceable is the use of a click-wrap in which the web scraper has to actively click to accept the T&Cs, meaning there is proof of acceptance of the T&Cs. Copyright or Trademark Infringement In the US, the legal doctrine of fair use allows limited use of copyrighted material under certain conditions without the explicit permission of the copyright holder. Uses for such purposes as parody, criticism, commentary, or academic research are regarded as fair use. Legal precedent and the 1976 Copyright Act set out four specific factors to determine whether use of copyright material by scraping or other means is fair use, though they may also take other considerations into account. Those four main factors are: The purpose of the use, including whether it is for personal or commercial use, and the degree to which it is transformative in some way. The nature of the copyrighted work. The volume of the material scraped in relation to the copyrighted work as a whole. The effect of the use on the potential market for or value of the copyrighted work. Computer Fraud and Abuse Act (CFAA) There are several federal and state laws against hacking or accessing another s computer. The CFAA states that whoever intentionally accesses a computer without authorization and as a result of such conduct recklessly causes damage is basically in violation, especially if the violated website can prove loss or damages. Trespass to Chattels This is a term referring to a civil wrong which means one entity has interfered with another s personal property which causes loss of value or damages. Robots Exclusion Protocol This is an industry standard program that allows a website to embed robots.txt files within their websites that communicate instructions to web crawlers to indicate which crawlers can access the site, and which pages they can access. While a common protocol, it has limited legal value, mainly as supporting evidence in cases involving breach of terms and conditionshas limited legal value, mainly as supporting evidence in cases involving breach of terms and conditions.

6 The situation in the European Union is governed by different legislation and legal systems, but many of the same principles apply, for example relating to terms and conditions. Website and database owners have tended to rely on copyright infringement claims against screen scrapers there are multiple EU directives on copyright but there has been little case law to provide guidance. Some key provisions are: The EU Database Directive of 1996 provides legal protection for the creators of databases that are not covered by intellectual property rights, so that elements of a database that are not the author s own original creation are protected. In particular, it provides protection where there has been qualitatively and/or quantitatively a substantial investment in either the obtaining, verification or presentation of the contents. A 2015 ruling by the European Court of Justice in a case concerning the airline Ryanair greatly strengthened the ability of website operators to protect their content through contractual terms and conditions (see below) when they are not covered by the database directive. In addition to intellectual property rights infringement, in theory website owners have other legal arguments against web scraping. As in the US, in the UK a website owner could try to bring a common law tort claim for trespass to chattels. A website owner in the UK could also seek to rely on the Computer Misuse Act 1990 which prohibits unauthorized access to, or modification of, computer material. However, unlike the US, neither of these arguments has been tested in the UK courts specifically in relation to website scraping. Selected Legal Cases While theories of liability are still developing, there is some evidence that US courts are leaning towards protecting proprietary content, especially when there is some indication of damages suffered by site owners. On the other hand, vast amounts of web scraping and content harvesting take place every day and never face challenges. The use of big data and open data is not unregulated, but it breeds a culture of find and re-use that can become pervasive. Circumstances vary in each case and depend on how the scraper accessed content, what the scraper harvested, and the terms of use presented by the website owner. The following are a sample of issues and resolutions illustrating the range of liabilities. Associated Press vs. Meltwater This high-profile case was decided for Associated Press (AP). The media-monitoring firm Meltwater had been crawling AP websites and had extracted and republished significant amounts of text from AP news articles without adding any commentary or insight. Meltwater s stance was that it was operating under fair use provisions of copyright laws. Although the court found in favor of AP, the firms settled the case and entered into a partnership arrangement to develop new products. Ticketmaster vs. Riedel Marketing Group Riedel Marketing Group (RMG) used web robots to saturate the Ticketmaster site so that it could harvest large quantities of desirable tickets for resale. Ticketmaster argued that RMG had agreed to the terms and conditions of the site but ignored them. Ticketmaster obtained a preliminary injunction against RMG, and the court held that RMG had infringed on Ticketmaster s copyrighted material. ebay vs. Bidder s Edge Bidder s Edge used a web scraper to obtain details of active ebay auctions and reposted them on its own site. In addition, it crawled ebay s site more than 100,000 times per day, which hindered ebay s site performance. Bidder s Edge ignored the robots.txt files as well as a cease and desist letter from ebay, and ebay brought a trespass to chattels claim against Bidder s Edge. ebay successfully obtained a preliminary injunction against Bidder s Edge, since the court decided that Bidder s Edge spiders interfered with the capacity of ebay s servers. QVC vs. Resultly QVC s terms and conditions did not prohibit web scraping, but Resultly was so aggressive with its scraping that QVC s servers became overburdened, and customers were unable to complete transactions. QVC alleged a loss of $2 million, and sued Resultly

7 under the Computer Fraud and Abuse Act (CFAA), seeking a preliminary injunction. The court ruled that that the plaintiff had to prove that the defendant both knowingly and intentionally intended to cause damage to the plaintiff s computer. In this case the court found that Resultly did not intend to cause damage, so QVC was unable to show that Resultly violated the CFAA. Craigslist vs. Naturemarket Due to the extensive amount and ever-changing listings on Craigslist, the site has been the target of numerous scrapers over the years, and it has successfully sued or stopped numerous organizations from scraping data. In one case, Naturemarket sold software that made it easy for customers to automatically post listings to Craigslist and to post listings on behalf of customers. Naturemarket also scraped email addresses from Craigslist s site. Craigslist sued, using copyright infringement, the Computer Fraud and Abuse Act, and breach of terms of use. Naturemarket failed to contest the suit, and the court awarded Craigslist a judgment of $1.3 million. Google vs. Authors Guild A related case, though not directly concerning web scraping. The Second Circuit of the Court of Appeal held that Google s book scanning of millions of books is fair use even though they have copyright protection, because of the transformative nature of its behavior under the fair use doctrine. The court also confirmed that facts are not protected by copyright, suggesting that harvesting factual data from a website is not of itself an infringement. Ryanair Ltd vs. PR Aviation BV The most important EU case. In its 2015 judgment, the European Court of Justice ruled that owners of publicly available databases, which do not fall under the protection of the Database Directive, are free to restrict the use of the data through contractual terms on their website. The case was brought by the budget airline Ryanair against PR Aviation, a travel website operator that extracted data from the Ryanair s website in order to compare prices and book flights on payment of a commission. Ryanair required anyone accessing flight data on its website to tick a box to accept its terms and conditions, which included a prohibition on the automated extraction of data from their website for commercial purposes, without the airline s permission. The court found that the Database Directive did not protect Ryanair, but ironically the airline has gained greater protection as a result because it is free to create contractual limits on the use of its database. Businesses that carry out screen scraping activities in the EU are at an increased risk of being sued for breach of contract by database owners as a result. All a website or database owner needs to do to lawfully protect their data is to include terms and conditions on their website that prohibit screen scraping and require users to accept the terms before using the site where the relevant data is located.

8 Essential Actions and Best Practices Web scraping can be a useful and constructive tool for gathering unstructured information to use for insights on multiple levels. While the legal landscape remains unsettled, and uncertainties abound in terms of what is permissible with web scraping, in Outsell s opinion there are legal, ethical, and social boundaries that are important to maintain. Companies that conduct scraping activities must carefully consider the consequences of their actions. As in many grey areas of business, it s as much a question of managing the risk as of knowing what may or may not turn out to be legal. The key question often becomes How likely are we to be sued and how big is the risk? The following items are best practices that can limit risk or liability when conducting web scraping: Comply with terms and conditions and robots.txt: If there is an advisory in the form of a robots.txt file, or terms and conditions specified which permits, limits, or prohibits web scraping, it s best to stay in compliance and not breach the implied contract. Don t overburden the website: Make sure that excessive queries don t interfere with a website s normal processes, slow a site s performance, or cause unintended harm to the website operator. Copy as little as possible: Avoid making complete copies of web pages or design elements from the site. Focus on factual information such as pricing data, dates, locations, etc. When it comes to non-factual information, store it only as long as required for analysis. One company that we interviewed has a policy of keeping non-factual, competitive information for only 30 days, during which they analyze it, and then discard it. Don t plagiarize: Don t take information from a site and redistribute it or use it verbatim. Transform it, analyze it, or use it in different ways than the original site. Avoid crawling or storing sensitive data: Certain types of non-public information or personally identifiable information, such as names, birth dates, or email addresses, are off-limits. Don t compete: Avoid using scraped data to offer a competing product or service, which can potentially cause damage to the scraped website. Be transparent: Scraping technology is such that it s possible to put a company s contact details in the scraper s header to identify who is doing the scraping. Using technical trickery to mask IP addresses, bypass CAPTCHAs, or otherwise conceal or filter activities is off limits. Don t misrepresent one s identity or purpose, or use deception in order to gain access. Use an API if it s there: Many websites offer the ability to download data via an API, either free or for a fee. If that is the case, it s preferable to obtain the data that way, rather than by scraping. Obey cease and desist requests: If a website operator wants the scraping to stop, acknowledge and stop. In the long run, it s better to adhere to the request than to risk liability. Keep up-to-date: The legal landscape is still in its formative stages and there are no clear-cut patterns in the decisions. In addition, intellectual property laws are not uniform among various jurisdictions or countries. It is important to keep abreast of the developing case law centered on web scraping to mitigate risk. Develop a written company policy: Web scraping is a normal and expected business activity. It s not possible to eliminate risk altogether, but establishing and adhering to clear guidelines for web scraping activities is critical from both a legal and ethical standpoint.

9 Related Research Reports Text and Data Mining: Technologies Under Construction January 1, 2016 Data Business Fundamentals Novemeber 24, 2015 Evaluating Automated Content Tools August 13, 2015 Insights Researcher Open Data Practices More Prevalent Than Previously Thought Novemeber 21, 2016 European Commission Proposes Controversial Changes to EU Copyright Laws October 26, 2016 Fair Use in Copyright Law Open to Exploitation October 25, 2015

10 ABOUT THE AUTHOR Simon Alterman VP & Lead Analyst, Manager of Outsell s Leadership Community +44 (0)20 7419 2352 salterman@outsellinc.com See additional reports published in all coverage areas. Does this report meet your needs? Send us your feedback. ABOUT OUTSELL The rapid convergence of information, media, technology, and data is reshaping businesses every day. Enter Outsell, Inc., the only research and advisory firm focusing on these four sectors. As the trusted advisor to executives, our analysts turn complexity into clarity, and provide the facts and insights necessary to make the right decisions. Our proven blend of big data, research, proprietary intelligence, and exclusive leadership communities produces tangible results and a strong ROI. We promise to deliver wow and ensure our clients stay more focused, save time, and grow revenue in a fast-changing digital world. www.outsellinc.com contact_us@outsellinc.com Burlingame, CA USA +1 650-342-6060 London, United Kingdom +44 (0)20 8090 6590 Outsell, Inc. is the sole and exclusive owner of all copyrights and content in this report. As a user of this report, you acknowl- edge that you are a licensee of Outsell s copyrights and that Outsell, Inc. retains title to all Outsell copyrights in the report. You may use this report, only within your own work group in your company. For broader distribution right options, please email us at info@outsellinc.com. The information, analysis, and opinions (the Content ) contained herein are based on the qualitative and quantitative research methods of Outsell, Inc. and its staff s extensive professional expertise in the industry. Outsell has used its best ef- forts and judgment in the compilation and presentation of the Content and to ensure to the best of its ability that the Content is accurate as of the date published. However, the industry information covered by this report is subject to rapid change. Outsell makes no representations or warranties, express or implied, concerning or relating to the accuracy of the Content in this report and Outsell assumes no liability related to claims concerning the Content of this report. Advancing the Business of Information 2016 Outsell, Inc.