Real-Time Analytics on Large Datasets: Predictive Models for Online Targeted Advertising Open Data Partners and AdReady April 2012 1
Executive Summary AdReady is working to develop and deploy sophisticated statistical models for large-scale targeted advertising. The system robustly and optimally arbitrates between thousands of advertisements in the time that it takes to load a webpage. There were several criteria considered when selecting the statistical tool to be used to develop models and score events in production: Performance. Thousand of events are scored per second. Scoring must be fast and scalable. Events must be scored and processed in tens of milliseconds. Every missed bid represents lost revenue. Portability. Models are self-describing and encapsulated and because of this, models can be moved easily from development to production environments and can also be ported easily between different systems and applications. Open Source. Open source software is particularly well suited to big data applications and cloud-based deployments since the software can be installed as required without worrying about licensing costs. With open source software, AdReady was able to keep its costs down and still scale out as required out. Having the source code available allows AdReady to insert custom code when required, rather than outside of a workflow, where any extra steps may require precious milliseconds. AdReady selected the open source Augustus system both to build the statistical models they required and also to deploy the models they built into their ad delivery system. Models are being deployed in Amazon s elastic infrastructure and models are moved between the statistical modeling environment and the production environment using Predictive Model Markup Language (PMML), the most widely deployed standard for describing statistical and data mining models. 2
Online Target Advertising AdReady is creating a fast, robust, and scalable ad-bidding system that allows custom campaigns to bid for web-page advertising space. The system uses sophisticated predictive models to identify bids that have high probability of winning, at a good price, on sessions likely to lead to conversion. AdReady s deployment environment was based on the following principles: Quickly deploy models. Models are built in a statistical modeling environment and exported as PMML. PMML (the Predictive Model Markup Language) is an XML-based standard for statistical modeling that can be visually inspected, tested in different scoring engines, internally validated with embedded test-data, and manipulated with simple scripts. PMML models are then imported into what is called a scoring engine that is integrated in the operational environment. Updating a model is as simple as reading a PMML file. No coding is required. Scale using Amazon s elastic infrastructure. With Amazon s elastic infrastructure, all that is required to scale is to add additional EC2 instances with embedded scoring engines. With this approach, AdReady will be able to scale to 11,000 events per second. Don t touch disk. By linking Apache web servers and Augustus scoring engines using Amazon s distributed environments, scoring events using PMML statistical models could be done in tens of milliseconds. Leverage open source software whenever possible. By using open source software, AdReady is integrating statistical modeling into a a large-scale ad delivery system without incurring licensing fees. In addition to Augustus, AdReady also uses Apache and Django, a Python MVC web framework which can be tightly integrated with Augustus. Ad-bidding systems combine three extremes of statistical processing: large, fine-grained models, robust, high-volume throughput, and heterogeneous data. To solve all three problems, AdReady incorporated the Augustus statistical toolkit into their model-production and scoring workflows using Amazon s Web Services. 3
Predictive Models AdReady is employing segmented models in order to obtain more accuracy over highly heterogeneous data. A segmented model is a collection of complete models, one or more of which are selected based upon the input event. The entire collection of segmented models can be encapsulated as a single PMML file and uploaded into each elastic scoring instance. The model file runs in-memory, guaranteeing that 1) new instances always have the latest model and 2) models can be updated independently of machine images. Throughput To deploy the model, Augustus was integrated into the online system by invoking it as a Python library from the Django framework. This allows AdReady to bypass the normal data pipeline and connect the scoring engine directly into their web framework. Augustus is written in Python and NumPy to combine a Python-based coding environment with highly scalable numeric processing. Since the data are processed without reading or writing to disks, the whole system will be able to respond to web requests within 100 ms. If more capacity is required, copies of the entire stack can be launched as virtual machine (scale out instead of scale up). Data The production system must be robust against missing or corrupt data. Data comes from cookies and session metadata, and is enriched from backend content sources. PMML provides a framework for making decisions about invalid data at runtime, and the Python interface simplifies the handling of invalid data by passing values with inhomogeneous types. Unexpected input results in fallback procedures, rather than uncaught exceptions. Development Cycle The entire system is being developed on a tight schedule of four months from start to finish. This pace is possible because of transparency at all levels: PMML models are human-readable text files, Augustus is open-source, and the Python interface has a rapid development cycle. These conveniences for the programmers does not limit scalability because all large-scale numeric 4
calculations are performed in NumPy, which is based on compiled C and Fortran libraries. Model consistency is assisted by the internal validation features of PMML: sample data and expected results are embedded in each model file and can be tested automatically. This allows for Open Data and AdReady to pass working examples or test cases in need of attention or analysis back and forth within the models. This project is a collaboration between AdReady and Open Data Group. 5
About Open Data Open Data Group specializes in building predictive models over big data and is one of the pioneers using technologies such as Hadoop, NoSQL databases, and elastic Infrastructure as a Service so that companies can build predictive models efficiently over all of their data. Open Data Group provides outsourced analytical services, management consulting services, analytic staffing, and expert witnesses broadly related to data and analytics. It has been building predictive models over big data for over ten years and has introduced a variety of innovative technology related to predictive modeling and analytic architectures. AdReady AdReady delivers the power and sophistication of multiple best-in-class enterprise software solutions for digital display advertising in an elegant, costeffective, yet highly powerful next-generation platform. The AdReady technology platform provides: Highly scalable campaign creation, testing and iteration Creative production, iteration and testing automation Integrated access to micro-targeted audiences through publishers, exchanges, networks and data providers Geo, Demo, Behavioral and Re-Targeting capabilities Proven optimization algorithms Augustus Augustus is an Apache 2.0-licensed open source system for building and scoring statistical models designed to work with data sets that are too large to fit into memory. More information is available at http://augustus.googlecode.com. PMML Predictive Model Markup Language (PMML) is the leading standard for statistical and data mining models and supported by over 20 vendors and organizations. PMML is an XML mark up language to describe statistical and data mining models. More information is available at http://dmg.org. 6