Combating Fraud In Online Advertising

Combating Fraud In Online Advertising September 2014 Jason Shaw Senior Data Scientist jshaw@integralads.com Kiril Tsemekhman Chief Data Officer kiril@integralads.com /IntegralAds @IntegralAds IntegralAds

1 Bots As the Web continues to evolve, automated activity by software programs, colloquially referred to as bots, finds ever greater application. Near the end of 2013, security and content delivery firm Incapsula released a study based on 20,000 sites in its network showing that over 60% of Web traffic is bot-generated [1]. However, while this traffic is a central feature of the Web today, it remains poorly understood by many who have no direct interaction with it. In particular, it is essential to differentiate between good bot activity, which is undertaken with the informed consent of all relevant parties, and bad bot activity, which is carried out without the consent of, and often against the interests of, one or more relevant parties. Incapsula found that about half of all bot activity is good. However, the other half holds significant economic ramifications across many industries. In this paper we give a brief summary of the types of activity bots carry out, and specifically how bad bots are affecting the online advertising industry and what can be done about them. 1.1 Good Bots The prototypical example of a good bot is a web spider. Googlebot, for example, is the spider used by Google to catalog the content of the Internet, providing the raw information for the company s flagship product, its search engine [2]. Bots have also been used in attempts to participate in transactions more efficiently. High-frequency trading is a prominent example of this. Online betting service Betfair is embracing bot traffic as a new channel for bids [3]. Comparison shopping services also use the services of bots, in scraping a variety of retailer sites for pricing information or by automatically identifying good deals on a single site. While there is no universal agreement on the acceptability of this activity [4], it nevertheless is an example of an unconcealed, commercial use of bots. 1.2 Bad Bots The malicious uses of bots are much more varied, including malware dissemination, in-formation theft, fraud, or denial-of-service attacks. Except for the latter, most malicious bot activity is driven by a significant economic incentive. Denial-of-service attacks here is meant to include any activity whose purpose is to disrupt the legitimate business of another, either through the direct action of many bots simultaneously (distributed denial of service), or through corrupting some aspect of the business (data destruction, search engine blacklisting, etc.). One emerging type of bot activity is in the manipulating of social network measures of influence by purchasing Facebook Friends or Twitter Retweets [5]. A prominent example of bot-driven fraud is found in online advertising, and is the subject of the rest of this paper. 2 Fraud In Online Advertising Online advertising fraud is, fundamentally, a result of misaligned incentives. Advertisers seek the attention of website visitors. Attention, however, is extremely difficult to measure. Instead, proxies are used. In one common case, when an ad server receives an HTTP request for a particular ad s URL, this is noted and the advertiser is charged accordingly. This is known as cost-per-mille (CPM) compensation, so named because prices are stated in terms of the cost for every 1,000 impressions. Similarly, an advertiser may decide that clicks are a more relevant measure of attention, and instead pay according to how many times the target URL of an ad is * For more information about this whitepaper or Integral s services, please contact Integral Ad Science by calling (646) 278-4871 or sending an inquiry to info@integralads.com. Integral Ad Science 2

requested in a cost-per-click (CPC) scheme. Because the measure is no longer user attention but an HTTP request, it is a relatively simple matter for a bot to generate it without the presence of a human, defrauding the advertiser. Other methods of compensation are possible, including cost-per-action (CPA), which is meant to reward only those impressions which lead to a desired outcome (e.g., a purchase or site visit). However, methods of attributing actions to impressions are still extremely primitive and subject to the same manipulations as CPM and CPC schemes. Online advertising fraud can be found on any ad-supported website and affect any advertiser. The typical arrangement is for a website to purchase traffic from a supplier which uses bots to fulfill the order. This allows for websites to plead ignorance of the true nature of their traffic (since purchasing traffic remains a generally accepted practice), and permits the botnet controller to earn money from a wide variety of websites without needing to personally manage each one. The sites involved range from premium publishers needing to reach a traffic quota, to struggling websites seeking to monetize their content, to websites established for the sole purpose of carrying out ad fraud. In the third case, the sites may indeed be managed by the botnet controller. In addition, publishers who actively avoid practices which could bring bot traffic to their sites may still be vulnerable: in order to appear attractive to advertisers and garner higher bids on fraudulent inventory, bots will sometimes visit premium publishers to collect cookies and establish a profile that will be targeted by ad networks. 3 Combating Online Advertising Fraud A number of approaches exist for dealing with the threat of bot activity defrauding advertisers. Some of the more prominent, along with their shortcomings, are detailed below. 1) Post-event reporting Approach: Analyze data offline to discover evidence of fraudulent activity. Report this information to the client. The client uses the information to guide future business decisions including data targeting and media planning. Shortcomings: The data is a day late and a dollar short. The client has already paid for the fraudulent impressions. Whatever decisions the client could take to avoid the issue in the future are of limited effect, since the data quickly becomes out of date. 2) Blacklist/whitelist Approach: Analyze data offline to discover evidence of fraudulent activity. Compile a list of websites with rates of fraud higher than an acceptable threshold. Refuse to serve any ads on these websites. Shortcomings: This scorched-earth approach certainly avoids fraud where it has been seen before, but by its nature is not a real-time process. New domains are being registered constantly to commit ad fraud, making a blacklist always out of date. Additionally, many websites with significant fraudulent volume also have large organic volume, such as premium sites trying to achieve a higher traffic target. Using a blacklist to exclude fraudulent inventory on these sites may eliminate a valuable segment from an advertiser s audience. Integral Ad Science 3

3) Post-event reporting Site-level Approach: Compile historical rates of fraud by web page, enabling clients to target segments with lower historical rates. User-level Approach: Compile historical rates of fraud by user, enabling clients to exclude specific users from targeting. Shortcomings: Targeting at the site level carries the same disadvantages as using a blacklist or whitelist. Targeting at the user level is among the most precise means available of avoiding fraudulent inventory. In either case, the relatively slow speed at which RTB third-party data is updated quickly renders the data out-of-date. An efficient way of updating data in targeting platforms is required to make user-level fraud prevention effective. Standard cookie syncing can help alleviate this issue but adds additional complexity and uncertainty. More efficient implementations have recently become available from vendors. 4) Revised compensation Approach: This approach is post-event reporting with the additional arrangement between the advertiser and publisher (or any intermediaries such as ad networks) that any inventory deemed fraudulent should not be billed for. Shortcomings: Claw-back negotiations can be messy, heated, and time-consuming. In some cases, it may be impossible to claim make-goods, such was when ad networks buy media through RTB platforms. 5) Real-time detection Approach: Collect information about the impression and the user s environment and leverage historical data for similar impressions to establish whether the impression is fraudulent. If it is, terminate the call to the ad server or disqualify the opportunity from an RTB auction, thus cutting off the flow of money to the fraudsters without engaging in claw-back negotiations. Shortcomings: Actively interfering with the ad call could provide fraudsters with greater detail about how they are being caught, accelerating their pace of innovation, though such an observation has not yet been made. In addition, this solution is not applicable to programmatic buyers who do not have access to the browser itself at bidding time. 6) Causal attribution Approach: Pay according to a CPA scheme which uses causality as the foundational component of its attribution methodology. Since bots will not make purchases, impressions served to bots will not be paid for as they caused no conversions. Shortcomings: While this represents an ultimate solution in permanently devaluing impressions delivered to bots, an accepted industry standard for causal attribution is likely years away from adoption. Furthermore, explicit removal of fraudulent impressions may still be needed, and a new conversion fraud front may open as soon as this becomes critical for fraudsters to sustain their business. Integral Ad Science 4

4 Detecting Bot Activity The information useful in detecting fraudulent activity can be divided into two main classes: session-based and historical. On each impression, JavaScript code or other browser-based software can be deployed to collect a variety of information, including features of the browser environment, detailed viewability measurements, page structure, ad provenance (networks, exchanges, etc. involved in the sale of the inventory), the URL the ad is to be run on, and more. All of this information is reported to a server, together with the standard information provided in IP, TCP, and HTTP headers. These features can be evaluated on their own, as session-based signals, or can be used to place the impression in broader historical patterns. Both major approaches to bot detection described below are statistical in nature: one impression is almost never sufficient to confidently flag the browser as being controlled by a bot; even after analyzing many impressions, the result is still a probability of machine being a bot (though this probability can quickly approach 100%). Once the determination is made subsequent actions are fully deterministic. However, in most cases at least several impressions need to be analyzed, together with a detailed view of broad Internet traffic patterns; the denser the data available to the fraud detection solution, the faster and more reliable this detection is. The most powerful bot detection techniques are aptly described by the colloquial term Big Data. 4.1 Features 4.1.1 Session-based Signals There are many examples of session-based signals which do not require historical context to interpret. For example, information collected about page structure can indicate an instance of pixel stuffing, in which entire pages are placed inside tiny iframes on a containing page (see, for example, [6]). This may or may not be associated with bots, but it is fraudulent in either case. Another potential signal is in the browser operated by the user. The HTTP header makes a claim that the request comes from a particular version of a browser, and this claim can be assessed in multiple ways by fraud detection code. So-called user agent spoofing is a highly reliable indicator of fraudulent activity. Another technique is to take measurements of the system s performance. Often a bot will be running on an older, slower computer, or placing such load on the system that the browser runs slowly. In addition, the pared-down, backgrounded environments that bots often employ to load web pages undetected do not behave identically to those in fully-rendered browsers. Because of these factors, measurements of the system look different in bots than in human users. This is sometimes known as side-channel analysis. Session-based signals tend to be highly reliable indicators of fraud. However, in order to evade detection bot programmers work to make their programs mimic human behavior as closely as possible. In most cases, the information collected from a single impression is not sufficient to classify it as bot-generated. Consequently, it is also critical to analyze historical patterns of activity, where deterministic, programmed behavior is much harder to hide. Integral Ad Science 5

4.1.2 Historical Information More comprehensive than the session-based signals described above are the patterns of activity observed over time periods of a few minutes to a few days. One approach is to compile statistics of the many features described above, broken down along many different dimensions. Each impression is looked up against this database of historical evidence and a verdict is reached by evaluating the collection of data points. For example, one may compute the historical geographic distribution of audience by web page, as well as by various exchanges and networks, as well as by those exchanges and networks on that page. These different slices can be combined together, along with the knowledge of typical patterns of usage, to strengthen or refute a claim of fraud. In addition, the browsing history of individual users can reveal signatures of fraudulent behavior. Bots within a particular botnet show behavior that is highly correlated. There are a number of reasons for browsing behavior to be correlated on the Internet: users have similar interests to each other, websites have organizational or collaborative connections with each other through hyperlinks, and bots make their way around a list of websites they are being paid to visit. Each of these types of correlation may be distinguished from one another. A working paper describing this technology was recently presented at the 2014 Winter Conference on Business Intelligence in Utah [7]. A major benefit of this sort of approach is that it directly characterizes the business model of bot-driven impression fraud (i.e., generating unreasonably high traffic to a small community of websites using a distinct subset of users ) and is therefore more robust to modification to the operation of the botnet. 4.2 Challenges Depending on the historical measures used, the amount of new data required to identify a bot may range from a few minutes to a few days. For techniques requiring longer time ranges, bots may appear and inflict significant financial damage before they are detected and neutralized. Therefore it is critical to devise schemes which can successfully operate with short histories. Collecting impression data at high rates improves the effectiveness of the fraud detection solution and can reduce the length of time needed to make a determination. Additionally, many of the signals (of either type) which can identify fraudulent activity are only circumstantial evidence. For example, monitoring the bounce rate of users or the browser versions being used can be indicative of fraud, but are not robust to changes in bot behavior. It is a relatively simple matter to alter bot instructions to circumvent detection techniques such as these, so it is imperative to utilize methods which cannot be defeated without a change to the underlying business model of advertising fraud. Identifying correlated browsing patterns, for example, is one approach which is robust in this way. There is no consensus yet on whether such dishonest practices as employing human farms to drive larger number of impressions, bid URL spoofing, or serving video ads into display placements should also constitute fraud. While these advertisements are unlikely to be effective, they can still be seen by humans and, therefore, cannot be classified as non-human traffic. Integral Ad Science 6

4.3 Validation It is extremely difficult to test fraud detection in a way that is representative and comprehensive. Some have elected to purchase traffic to a website and observe the resulting levels of fraud reported. However, this is highly uncertain often, the traffic provided will be human so a lack of detection is not necessarily indicative of a problem. On the other hand, a sure way to generate bot traffic is to infect a honeypot with malware committing impression fraud, and then verify that traffic from this machine is flagged. While its identity as bot traffic is indisputable, it represents just a single variant of some piece of malware, and cannot be assumed to be representative of a detection scheme as a whole. 5 Conclusion This paper presents the authors view on the current state of the art methods of combating non-human traffic in online advertising. The nature of fraud online is dynamic, with new types of fraud and new masking techniques appearing all the time. Solutions such as those capturing macroscopic patterns reflective of the bot fraud business model are quite robust to changes in the technology controlling the bots, but vigilance is imperative. We must be ready for a long fight with well-equipped, highly incentivised, and creative opponents, who we can only defeat through innovations in our infrastructure, business models, and data science. References [1] Incapsula. Report: Bot traffic is up to 61.5% of all website traffic. Dec. 2013. URL: http://www.incapsula.com/blog/bot-traffic-report-2013.html. [2] Google Inc. Googlebot. 2014. URL: https://support.google.com/webmasters/answer/182072?hl=en. [3] BF Bot Manager. Bf Bot Manager - Probably best Betfair bot system in the world. 2014. URL: http://www.bfbotmanager.com/. [4] Phyllis Plitch. Are Bots Legal? Comparison-shopping sites say they make the Web manageable. Critics say they trespass. In: The Wall Street Journal (Sept. 2002). URL: http://online.wsj.com/news/articles/sb1031785433448138595. [5] Nick Bilton. Friends, and Influence, for Sale Online. In: The New York Times (Apr. 2014). URL: http://bits.blogs.nytimes.com/2014/04/20/friends-and-influence-for-sale-online/. [6] Panos Ipeirotis. Uncovering an advertising fraud scheme. Or The Internet is for porn. Mar. 2011. URL: http://www.behind-the-enemy-lines.com/2011/03/uncovering-advertising-fraud-scheme.html. [7] Jason L. Shaw and Kiril Tsemekhman. Fraud in online advertising: a case study in systematic error in business intelligence. In: 2014 Winter Conference on Business Intelligence (Feb. 2014). Integral Ad Science 7