1 1 WEB SCRAPING LEGAL ISSUES AND BEST PRACTICES FOR PUBLISHERS CEO Topics December 9, 2016
2 2 Table of Contents Why This Topic 3 Methodology 3 Definition and Use Cases 4 Sample Use Cases 4 Legal Considerations 5 Selected Legal Cases 6 Essential Actions 7
3 3 Why This Topic Web scraping is an important tool for information businesses, both for building products and for internal uses such as competitive intelligence. The volume of information available on the open web and the ever-increasing sophistication of web scraping technologies allows organizations of all sizes to gather large quantities of rich data for analytics purposes. The sources of scraped material are myriad, ranging from company websites to social media to government data sets, as are the reasons for conducting web scraping. Web scraping activity often sets the interests of a website owner against those who want to harvest the data. Owners usually want to control, profit from, or leverage the data they have on their website. Those who collect data may want to use it for building new data-driven information solutions of their own, or for internal uses such as competitive intelligence, analysis reasons often completely unconnected to the original purpose of publication on the site. As the use of big data analytics to find patterns and trends hidden in unstructured or seemingly unrelated data sets continues to grow, the degree and volume of web scraping also continues to increase. Where many of the first court cases involving web scraping centered around simple copying, republishing, or linking to scraped data for competitive purposes, the use of data analytics tools has broadened the playing field of possible uses of scraped material, and it is likely to make interpretations of liability more nuanced and driven by the circumstance of each specific case. Outsell examined the use of web scraping and profiled leading suppliers of such services in our 2015 report Evaluating Automated Content Tools. In this report, we look at the definition of web scraping and some use cases in current practice, followed by a review of the legal considerations and theories of liability by which website owners have tried to sue or prevent web scraping activity. We also present some sample cases illustrating those theories. Finally, we look at some best practices for companies that use or are considering the use of web scraping as part of their business intelligence or product development practices. It s essential for leaders to consider this evolving mechanism for driving data-driven solutions and be clear on the current state of potential liabilities and best practices. Methodology Primary research for this report comprised a series of 10 interviews with information managers, business intelligence researchers, and legal practitioners. We supplemented those interviews with secondary research from published reports, industry blogs, mainstream press, and blogs. Outsell s ongoing dialogue with the market added depth to the primary and secondary research, as did our regular coverage of state-of-the-art vendors and emerging companies who are using this technique to develop new commercial offerings.
4 4 Definition and Use Cases If web content can be viewed on a page, it can be scraped. Web scraping, sometimes called web harvesting or web data extraction, is a software technique used to copy or collect selected information from websites. Automatic crawling of websites is widespread, with crawlers (or robots or spiders ) extracting varying amounts of data and information. Some companies use web scraping to harvest publicly available data (such as financial data, weather, news, company information, or sports scores) to enhance existing offerings or build completely new products. Others use it to watch trends, gather competitive intelligence and product information, monitor changes in websites, or gain strategic insight. Web search engines use a similar technique to index websites, capturing links and short snippets of information, in order to catalogue the data and provide accurate search results. Websites With HTML Pages Web Scraping Structured Sample Use Cases Real Estate: Scraping housing sale or rental data from online real estate listings in order to track pricing trends over time, average prices, or uncover indicators useful to real estate agents, builders, investors, etc. Legislative or Regulatory: Scraping government or regulatory agency websites to discover, understand, or monitor changes or developments with legislation or regulations. Health Care: Scraping healthcare forums and blogs to gather conversations or comments about particular drugs, medical devices, or treatments in order to monitor brand reputation, adverse effects, or other clinical information. Law Enforcement: Identifying patterns that ensure compliance or help ferret out illegal activity. Lead Generation or company tracking: maintaining near real-time understanding of companies activities, officers, contact information and other critical details about products and services that are aggregated vertically or for broad multi-industry use. Public libraries: Archiving and preserving local or commercial history. University libraries: Archiving course syllabi and campus activities. Internet archives: Compiling free books, movies, music, software, etc., or simply archiving website to compile a history of the web. Numerous illicit applications and those used for hacking also employ web scraping. Corporate espionage, theft of intellectual property, wholesale copying of websites, or impacting website performance are all examples of employing scraping with an intent to cause damage or loss.
5 5 Legal Considerations The legal landscape around web scraping is still evolving. Under certain circumstances, outlined below, US courts have found that the activity is legal and something that website operators should expect. The activity touches on many of the most sensitive legal and political areas of the digital era: The ability to protect and enforce copyright when technology makes copying and sharing so easy. The power of Google, Facebook, and others as they become not just pointers linking to content but the home of that content. The ownership of user generated content shared on social platforms. The protection of personal data. By some estimates there have been fewer than 100 cases involving web crawling in the US. One of the reasons is that plaintiffs have been unlikely to win such cases, and it s hard to obtain compensation in the event of a favorable judgment. Therefore, website owners are reluctant to incur the costs of legal action even if there is clearly a web crawler violating the terms of access to their site. There are several theories of infringement or liability by which owners of websites have brought legal action against web scrapers. However, the case law is still very much in flux. The most prominent cases involved one or more of the following: Breach of Terms and Conditions Many websites post terms and conditions (T&Cs) prominently on their pages, addressing the issue of access to their website via scrapers. This is intended to create breach of contract liability by establishing a contract between the website owner and the scraper. But posting T&Cs (also known as a browse-wrap ) may not be enough to show that a scraper has breached the terms of the website, since there is no active acceptance on the part of the scraper. What appears to be more enforceable is the use of a click-wrap in which the web scraper has to actively click to accept the T&Cs, meaning there is proof of acceptance of the T&Cs. Copyright or Trademark Infringement In the US, the legal doctrine of fair use allows limited use of copyrighted material under certain conditions without the explicit permission of the copyright holder. Uses for such purposes as parody, criticism, commentary, or academic research are regarded as fair use. Legal precedent and the 1976 Copyright Act set out four specific factors to determine whether use of copyright material by scraping or other means is fair use, though they may also take other considerations into account. Those four main factors are: The purpose of the use, including whether it is for personal or commercial use, and the degree to which it is transformative in some way. The nature of the copyrighted work. The volume of the material scraped in relation to the copyrighted work as a whole. The effect of the use on the potential market for or value of the copyrighted work. Computer Fraud and Abuse Act (CFAA) There are several federal and state laws against hacking or accessing another s computer. The CFAA states that whoever intentionally accesses a computer without authorization and as a result of such conduct recklessly causes damage is basically in violation, especially if the violated website can prove loss or damages. Trespass to Chattels This is a term referring to a civil wrong which means one entity has interfered with another s personal property which causes loss of value or damages. Robots Exclusion Protocol This is an industry standard program that allows a website to embed robots.txt files within their websites that communicate instructions to web crawlers to indicate which crawlers can access the site, and which pages they can access. While a common protocol, it has limited legal value, mainly as supporting evidence in cases involving breach of terms and conditionshas limited legal value, mainly as supporting evidence in cases involving breach of terms and conditions.
8 8 Essential Actions and Best Practices Web scraping can be a useful and constructive tool for gathering unstructured information to use for insights on multiple levels. While the legal landscape remains unsettled, and uncertainties abound in terms of what is permissible with web scraping, in Outsell s opinion there are legal, ethical, and social boundaries that are important to maintain. Companies that conduct scraping activities must carefully consider the consequences of their actions. As in many grey areas of business, it s as much a question of managing the risk as of knowing what may or may not turn out to be legal. The key question often becomes How likely are we to be sued and how big is the risk? The following items are best practices that can limit risk or liability when conducting web scraping: Comply with terms and conditions and robots.txt: If there is an advisory in the form of a robots.txt file, or terms and conditions specified which permits, limits, or prohibits web scraping, it s best to stay in compliance and not breach the implied contract. Don t overburden the website: Make sure that excessive queries don t interfere with a website s normal processes, slow a site s performance, or cause unintended harm to the website operator. Copy as little as possible: Avoid making complete copies of web pages or design elements from the site. Focus on factual information such as pricing data, dates, locations, etc. When it comes to non-factual information, store it only as long as required for analysis. One company that we interviewed has a policy of keeping non-factual, competitive information for only 30 days, during which they analyze it, and then discard it. Don t plagiarize: Don t take information from a site and redistribute it or use it verbatim. Transform it, analyze it, or use it in different ways than the original site. Avoid crawling or storing sensitive data: Certain types of non-public information or personally identifiable information, such as names, birth dates, or addresses, are off-limits. Don t compete: Avoid using scraped data to offer a competing product or service, which can potentially cause damage to the scraped website. Be transparent: Scraping technology is such that it s possible to put a company s contact details in the scraper s header to identify who is doing the scraping. Using technical trickery to mask IP addresses, bypass CAPTCHAs, or otherwise conceal or filter activities is off limits. Don t misrepresent one s identity or purpose, or use deception in order to gain access. Use an API if it s there: Many websites offer the ability to download data via an API, either free or for a fee. If that is the case, it s preferable to obtain the data that way, rather than by scraping. Obey cease and desist requests: If a website operator wants the scraping to stop, acknowledge and stop. In the long run, it s better to adhere to the request than to risk liability. Keep up-to-date: The legal landscape is still in its formative stages and there are no clear-cut patterns in the decisions. In addition, intellectual property laws are not uniform among various jurisdictions or countries. It is important to keep abreast of the developing case law centered on web scraping to mitigate risk. Develop a written company policy: Web scraping is a normal and expected business activity. It s not possible to eliminate risk altogether, but establishing and adhering to clear guidelines for web scraping activities is critical from both a legal and ethical standpoint.
9 9 Related Research Reports Text and Data Mining: Technologies Under Construction January 1, 2016 Data Business Fundamentals Novemeber 24, 2015 Evaluating Automated Content Tools August 13, 2015 Insights Researcher Open Data Practices More Prevalent Than Previously Thought Novemeber 21, 2016 European Commission Proposes Controversial Changes to EU Copyright Laws October 26, 2016 Fair Use in Copyright Law Open to Exploitation October 25, 2015
10 10 ABOUT THE AUTHOR Simon Alterman VP & Lead Analyst, Manager of Outsell s Leadership Community +44 (0) See additional reports published in all coverage areas. Does this report meet your needs? Send us your feedback. ABOUT OUTSELL The rapid convergence of information, media, technology, and data is reshaping businesses every day. Enter Outsell, Inc., the only research and advisory firm focusing on these four sectors. As the trusted advisor to executives, our analysts turn complexity into clarity, and provide the facts and insights necessary to make the right decisions. Our proven blend of big data, research, proprietary intelligence, and exclusive leadership communities produces tangible results and a strong ROI. We promise to deliver wow and ensure our clients stay more focused, save time, and grow revenue in a fast-changing digital world. Burlingame, CA USA London, United Kingdom +44 (0) Outsell, Inc. is the sole and exclusive owner of all copyrights and content in this report. As a user of this report, you acknowl- edge that you are a licensee of Outsell s copyrights and that Outsell, Inc. retains title to all Outsell copyrights in the report. You may use this report, only within your own work group in your company. For broader distribution right options, please us at The information, analysis, and opinions (the Content ) contained herein are based on the qualitative and quantitative research methods of Outsell, Inc. and its staff s extensive professional expertise in the industry. Outsell has used its best ef- forts and judgment in the compilation and presentation of the Content and to ensure to the best of its ability that the Content is accurate as of the date published. However, the industry information covered by this report is subject to rapid change. Outsell makes no representations or warranties, express or implied, concerning or relating to the accuracy of the Content in this report and Outsell assumes no liability related to claims concerning the Content of this report. Advancing the Business of Information 2016 Outsell, Inc.
Acceptance of Terms Last Updated: January 24, 2014 Terms of Service Please read this Terms of Service Agreement carefully. MedicaidInsuranceBenefits.com ("MedicaidInsuranceBenefits.com," "our," "us") provides
These terms of service (the "Terms") govern your access to and use of the Online File Storage ("OFS") websites and services (the "Service"). The Terms are between DigitalMailer, Incorporated and Digital
Terms and conditions of use 1. Introduction 1.1 These terms and conditions govern your use of our website. 1.2 By using our website, you accept these terms and conditions in full; accordingly, if you disagree
ETPL Extract, Transform, Predict and Load An Oracle White Paper March 2006 ETPL Extract, Transform, Predict and Load. Executive summary... 2 Why Extract, transform, predict and load?... 4 Basic requirements
Beyond Market Research A New Look At Defining The Insights Sector Harry Henry VP & Lead Analyst IIeX 2014 June, 2014 Atlanta, GA A d v a n c i n g t h e B u s i n e s s o f I n f o Ar md va at ino cn i
Hibbett Sports Messaging Service (SMS) Terms and Conditions This document provides the Terms and Conditions for the Hibbett Sports Messaging Service (SMS) (hereafter, the Service ). By using the Service
Terms and Conditions Agreement between user and internetsecurityservices.org Welcome to internetsecurityservices.org. The internetsecurityservices.org website (the "Site") is comprised of various web pages
LICENSE AGREEMENT FOR TOBII ANALYTICS SOFTWARE DEVELOPMENT KIT AND API PREAMBLE This Tobii Analytics Software Development Kit and API License Agreement (the "Agreement") forms a legally binding contract
Legal Terms and Conditions of www.loewecraftprize.com The website http://www.loewecraftprize.com is owned and managed by LOEWE, (with headquarters in Spain, Madrid, calle Goya 4, 28001, and tax ID A-28003861)
Terms Of Service BY USING THE COMPANY'S SERVICES YOU AGREE TO BE BOUND BY ITS TERMS AND CONDITIONS. 1. Definitions. "The Company" means CIT Broadband, P.O. Box 122568, Fort Worth, TX 76121. "The Subscriber"
Internet Reputation Management Guidelines Building a Roadmap for Continued Success Table of Contents Page INTERNET REPUTATION MANAGEMENT GUIDELINES 1. Background 3 2. Reputation Management Roadmap 5 3.
FAX-TO-EMAIL END-USER LICENSE AGREEMENT This Agreement, which governs the terms and conditions of your use of the Fax-to-Email Services, is between you ("you" or "End-User") and ( we, us, our or Company
Internet Services Terms and Conditions 1. These terms and conditions These General Terms and Conditions apply to you if you are a business or residential telecommunications customer of Telnet Telecommunication
SMALL BUSINESS OWNERS and THE CANADIAN LEGAL SYSTEM Understanding small business experience and attitudes toward legal disputes Helping small business owners understand and protect themselves from unexpected
Visit Lake Norman Lake Norman Convention & Visitors Bureau 19900 West Catawba Avenue, Suite 102 Cornelius, North Carolina 28031 704-987-3300 visitlakenorman.org TERMS AND CONDITIONS Visit Lake Norman (Lake
Wakefield Public Schools Technology Acceptable Use Policy Vision (per Technology Strategic Plan 2013-2016) Wakefield Public Schools considers classroom technology to be an essential, imperative part of
Background The Fair Debt Collection Practices Act (FDCPA) (15 USC 1692 et seq.), which became effective in March 1978, was designed to eliminate abusive, deceptive, and unfair debt collection practices.
Aitoc Software LLC License Agreement for Magento Extensions 1. General This is a legal Agreement between Aitoc Software LLC and the Customer that covers the Customer s purchase and use of Extensions for
Philips Secure Data Transfer Terms of Service th Revised: May 10, 2012 Thank you for using Philips Secure Data Transfer. These terms of service (the Terms ) govern your access to and use of Philips Secure
Rethinking Schools Limited Institutional Site License This License Agreement ( License ) is entered into the day of [20 ] ( Effective Date ) between Rethinking Schools Limited, a Wisconsin Corporation,
ELITEPAY TERMS AND CONDITIONS OF SERVICE EFFECTIVE: November 15, 2014 These terms and conditions of service ("Terms of Service") apply to your use of this ElitePay payment website (the "Website") and all
WEBSITE DISCLAIMER No warranties This website is provided as is without any representations or warranties, express or implied. The Gallery Tattoo Studio makes no representations or warranties in relation
THE BUSINESS COUNCIL OF WESTCHESTER Website & Internet Services Terms And Conditions of Use PLEASE READ THE FOLLOWING TERMS AND CONDITIONS RELATING TO YOUR USE OF OUR WEBSITE AND ANY OTHER INTERNET-BASED
SOFTWARE LICENSE AND NON-DISCLOSURE AGREEMENT This SOFTWARE LICENSE AND NON-DISCLOSURE AGREEMENT ( Agreement ) is between Drake Software, LLC ( Drake ) and Licensee (as defined below). PLEASE READ THIS
Website & Email Hosting Terms & Conditions 1-PARTIES Web Hosting Services are provided by TimeForCake Creative Media, Inc. ("TimeForCake") to Client conditional on the terms and conditions set forth below
Our Terms for Website Design, Development, Hosting and Promotional Services Terms of agreement by and between: Mad About U Media Ltd. ( PROVIDER ) and the ( CLIENT ). The CLIENT being the remitter of funding
SUPPLY AGREEMENT This Agreement ("Agreement") is entered into as of ("Effective Date"), between Nutratech, Inc., a New Jersey corporation having a place of business at 10 Henderson Drive, West Caldwell,
Wealth Training Pty Ltd - Right to Cancel Notice Wealth Training Pty Ltd Right to Cancel; Right of Rescission Purchaser may cancel this transaction, without penalty or obligation, at any time within the
Terms & conditions for the use of this Website IMPORTANT IT IS DEAMED THAT YOU HAVE READ AND AGREE TO ALL TERMS & CONDITIONS BEFORE USING THIS WEBSITE. By using this website you are deemed to have full
TERMS AND CONDITIONS INFLUENCERS AT WORK These TERMS AND CONDICTIONS (this Agreement ) are agreed to between InfluencersAtWork, Ltd. ( InfluencerAtWork ) and you, or if you represent a company or other
TaxSaverNetwork Terms of Service 1031 Exchange Advantage (1031 EA, LLC) provides 1031 accommodation services for investors to conclude a 1031 exchange under IRC code regulations. 1031 EA, LLC structures,
NOTICE: ACCESS OR USE OF ADHA s National Board Exam Review Course IS SUBJECT TO YOUR ACCEPTANCE OF THESE TERMS AND CONDITIONS. IF YOU DO NOT AGREE TO THESE TERMS AND CONDITIONS, YOU ARE NOT PERMITTED TO
London LAWN Terms of Service 1. GENERAL This WiFi Service is an Internet access service provided by Downtown London in partnership with Turnstyle Solutions which provides you with access to the Internet
An Introduction to the Legal Issues Surrounding Open Source Software By Daliah Saper Saper Law Offices, LLC 505 N. LaSalle, Suite #350 Chicago, IL 60654 http://www.saperlaw.com Open Source Software Open
User Agreement Quality. Value. Efficiency. Welcome to QVuE, the Leaders Network on Quality, Value and Efficiency website sponsored by The Medicines Company. The information provided in this Webinar Series
MAGUSA LOGISTICS WEBSITE TERMS AND CONDITIONS Introduction These terms and conditions govern your use of this website; by using this website, you accept these terms and conditions in full. If you disagree
Fair Debt Collection Practices Act Introduction The Fair Debt Collection Practices Act (FDCPA), effective in 1978, was designed to eliminate abusive, deceptive, and unfair debt collection practices. The
Terms and Conditions 1. About Us 1.1 www.phonefinder.co.za ("the Website") Phonefinder is an online cellular lead generation website, which enables users ("you, your") to enter their contact information
1.0 Overview The purpose of this Policy is to detail the University s plans to effectively combat the unauthorized distribution of copyrighted material by users of the Information Technology Resources,
GestInTime GESTINTIME SERVICE SaaS End- User License Agreement (EULA) IMPORTANT - READ CAREFULLY: This GESTINTIME SERVICE End- User License Agreement ("EULA") is a legal agreement between you (either an
UNITED STATES DISTRICT COURT DISTRICT OF MASSACHUSETTS IATRIC SYSTEMS, INC., Plaintiff, Civil Action No. 1:14-cv-13121 v. FAIRWARNING, INC., JURY TRIAL DEMANDED Defendant. COMPLAINT Iatric Systems, Inc.
End User License Agreement 1. PARTIES This Agreement is by and between KM NETWORK SDN. BHD ( 719624 T), a registered company in Malaysia, Address: 20 1, JALAN 24/70A, DESA SRI HARTAMAS, KUALA LUMPUR, MALAYSIA,
NPSA GENERAL PROVISIONS 1. Independent Contractor. A. It is understood and agreed that CONTRACTOR (including CONTRACTOR s employees) is an independent contractor and that no relationship of employer-employee
Data Subscription Service Data descriptions Order form Licence agreement Introduction Background The General Pharmaceutical Council (GPhC) is the regulator for pharmacists, pharmacy technicians and registered
Direct Marketing Rules Is your business compliant? June 2016 Our expertise Banking & Finance Charities Commercial Construction Corporate Corporate Tax Disputes Employment Family & Matrimonial Immigration
Introduction This web site and the related web sites contained herein (collectively, the Site ) make available information on hotels, resorts, and other transient stay facilities (each a Property ) owned,
WIN reserves the right to prioritize traffic based on real time and non-real time applications during heavy congestion periods, based on generally accepted technical measures. WIN sets speed thresholds
Bank Independent Online Financial Management Addendum This Online Financial Management Addendum (this OFM Addendum ) is an addendum to your Online Banking Agreement and Electronic Funds Transfer Act Notice
Automating policy enforcement to prevent endpoint data loss IBM Data Security Services for endpoint data protection endpoint data loss prevention solution Highlights Protecting your business value from
Robinhood Terms & Conditions Robinhood Financial LLC ( Robinhood Financial ), a wholly-owned subsidiary of Robinhood Markets, Inc. ( Robinhood Markets ), is a registered broker-dealer and member of FINRA
EMBEDDING BCM IN THE ORGANIZATION S CULTURE Page 6 AUTHOR: Andy Mason, BSc, MBCS, CITP, MBCI, Head of Business Continuity, PricewaterhouseCoopers LLP ABSTRACT: The concept of embedding business continuity
TERMS and CONDITIONS OF USE - NextSTEPS TM DATED MARCH 24, 2014. These terms and conditions of use (the Terms and Conditions ) govern your use of the website known as NextSTEPS TM, https://www.stepsonline.ca/
Web site Terms and Conditions Welcome to www.discipleshomemissions.org, the Web site of Div. of Homeland Ministries (d.b.a. Disciples Home Missions [DHM]), Indianapolis, Indiana. DHM and its affiliates
Online Study Affiliate Marketing Agreement This Affiliate Marketing Agreement (the "Agreement") contains the complete terms and conditions that apply to your participation as an Affiliate Marketer ("you,"
Terms and conditions of use 1. Introduction 1.1 These terms and conditions govern your use of our website. 1.2 By using our website, you accept these terms and conditions in full; accordingly, if you disagree
Terms and Conditions Terms and Conditions for Membership and Use, between Heritage Matrimonials and the Customer, and any Third Party. PLEASE READ THESE TERMS AND CONDITIONS CAREFULLY. BY USING THE HERITAGE
PAGE 1 of 5 PURPOSE Triton College s computer and information network is a continually growing and changing resource supporting thousands of users and systems. These resources are vital for the fulfillment
INTERNATIONAL MONEY EXPRESS (IME) LIMITED ONLINE REMIT USER AGREEMENT This User Agreement is version 1.01 and is effective from This Agreement Between International Money Express (IME) Limited (hereafter
ELKHART COUNTY BOARD OF REALTORS AND MULTIPLE LISTING SERVICE OF ELKHART COUNTY INC. VIRTUAL OFFICE WEBSITE (VOW) LICENSE AGREEMENT This License Agreement (the Agreement) is made and entered into between
Your Choice Counselling. Website Legal Notice Important - this is a legal agreement between you and Your Choice Counselling. Registered office: 2 Seaford Close, Burseldon, Southampton, Hampshire SO31 8GL
Freelancer Agreement If a Client and a Freelancer enter an independent contractor relationship, then this Freelancer Agreement ( Freelancer Agreement ) will apply. This Agreement is effective as of March
AGREEMENT BETWEEN USER AND International Network of Spinal Cord Injury Nurses The International Network of Spinal Cord Injury Nurses Web Site is comprised of various Web pages operated by International