Real-Time Analytics on Large Datasets: Predictive Models for Online Targeted Advertising

Similar documents
BIG DATA ANALYTICS For REAL TIME SYSTEM

Building a Scalable Big Data Infrastructure for Dynamic Workflows

Three steps to put Predictive Analytics to Work

Pentaho High-Performance Big Data Reference Configurations using Cisco Unified Computing System

Harnessing the power of advanced analytics with IBM Netezza

BUILDING A SCALABLE BIG DATA INFRASTRUCTURE FOR DYNAMIC WORKFLOWS

A Monitored Student Testing Application Using Cloud Computing

Data processing goes big

Hadoop in the Hybrid Cloud

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

In-Database Analytics

Scalable Architecture on Amazon AWS Cloud

Big Data and Healthcare Payers WHITE PAPER

KnowledgeSTUDIO HIGH-PERFORMANCE PREDICTIVE ANALYTICS USING ADVANCED MODELING TECHNIQUES

MATLAB in Business Critical Applications Arvind Hosagrahara Principal Technical Consultant

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

WHITE PAPER. Harnessing the Power of Advanced Analytics How an appliance approach simplifies the use of advanced analytics

GigaSpaces Real-Time Analytics for Big Data

A Sumo Logic White Paper. Harnessing Continuous Intelligence to Enable the Modern DevOps Team

Collaborative Big Data Analytics. Copyright 2012 EMC Corporation. All rights reserved.

Oracle Big Data SQL Technical Update

Comprehensive Analytics on the Hortonworks Data Platform

High-Performance Business Analytics: SAS and IBM Netezza Data Warehouse Appliances

A Next-Generation Analytics Ecosystem for Big Data. Colin White, BI Research September 2012 Sponsored by ParAccel

Oracle Real Time Decisions

Introduction to Arvados. A Curoverse White Paper

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

SIMPLE MACHINE HEURISTIC INTELLIGENT AGENT FRAMEWORK

SAP Predictive Analytics: An Overview and Roadmap. Charles Gadalla, SESSION CODE: 603

In-Memory Analytics for Big Data

Big Data Are You Ready? Jorge Plascencia Solution Architect Manager

BEST WEB PROGRAMMING LANGUAGES TO LEARN ON YOUR OWN TIME

AdTheorent s. The Intelligent Solution for Real-time Predictive Technology in Mobile Advertising. The Intelligent Impression TM

KnowledgeSEEKER POWERFUL SEGMENTATION, STRATEGY DESIGN AND VISUALIZATION SOFTWARE

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Accelerating Hadoop MapReduce Using an In-Memory Data Grid

BIG DATA SOLUTION DATA SHEET

Easy Execution of Data Mining Models through PMML

Platform as a Service: The IBM point of view

Predictive analytics for the business analyst: your first steps with SAP InfiniteInsight

Data Integration Checklist

Integrating a Big Data Platform into Government:

Advertising Automation SOFTWARE OVERVIEW

Global Headquarters: 5 Speen Street Framingham, MA USA P F

Databricks. A Primer

CUSTOMER Presentation of SAP Predictive Analytics

Oracle Big Data Building A Big Data Management System

WE RUN SEVERAL ON AWS BECAUSE WE CRITICAL APPLICATIONS CAN SCALE AND USE THE INFRASTRUCTURE EFFICIENTLY.

The basic data mining algorithms introduced may be enhanced in a number of ways.

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga

Base One's Rich Client Architecture

Tap into Big Data at the Speed of Business

Migration Scenario: Migrating Batch Processes to the AWS Cloud

Cloud Computing. Chapter 1 Introducing Cloud Computing

Databricks. A Primer

Building your Big Data Architecture on Amazon Web Services

The Virtualization Practice

Hadoop s Advantages for! Machine! Learning and. Predictive! Analytics. Webinar will begin shortly. Presented by Hortonworks & Zementis

Ø Teaching Evaluations. q Open March 3 through 16. Ø Final Exam. q Thursday, March 19, 4-7PM. Ø 2 flavors: q Public Cloud, available to public

Oracle Identity Analytics Architecture. An Oracle White Paper July 2010

Accelerating Web-Based SQL Server Applications with SafePeak Plug and Play Dynamic Database Caching

Creating Big Data Applications with Spring XD

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia

What Is Microsoft Private Cloud Fast Track?

EMC DOCUMENTUM Capital Projects Express. KEEP YOUR PROJECTS ON TRACK Flexible Document Control for Agile Teams

Software development & technologies in Market Research industry

Pulsar Realtime Analytics At Scale. Tony Ng April 14, 2015

Make the Most of Big Data to Drive Innovation Through Reseach

Data and Machine Architecture for the Data Science Lab Workflow Development, Testing, and Production for Model Training, Evaluation, and Deployment

Big Data Executive Survey

IBM Global Business Services Microsoft Dynamics CRM solutions from IBM

Professional Hadoop Solutions

EMC Greenplum Driving the Future of Data Warehousing and Analytics. Tools and Technologies for Big Data

Interactive data analytics drive insights

CIC Audit Review: Experian Data Quality Enterprise Integrations. Guidance for maximising your investment in enterprise applications

Web analytics: Data Collected via the Internet

2015 Ironside Group, Inc. 2

Big Data - Infrastructure Considerations

Dell* In-Memory Appliance for Cloudera* Enterprise

How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time

Using DeployR to Solve the R Integration Problem

A Brief Introduction to Apache Tez

ANALYTICS CENTER LEARNING PROGRAM

Up Your R Game. James Taylor, Decision Management Solutions Bill Franks, Teradata

Cloud Computing. Chapter 1 Introducing Cloud Computing

MicroStrategy Course Catalog

KNIME UGM 2014 Partner Session

Here s your full marketing OS. Reimagined.

Modern IT Operations Management. Why a New Approach is Required, and How Boundary Delivers

Hadoop & Spark Using Amazon EMR

Transcription:

Real-Time Analytics on Large Datasets: Predictive Models for Online Targeted Advertising Open Data Partners and AdReady April 2012 1

Executive Summary AdReady is working to develop and deploy sophisticated statistical models for large-scale targeted advertising. The system robustly and optimally arbitrates between thousands of advertisements in the time that it takes to load a webpage. There were several criteria considered when selecting the statistical tool to be used to develop models and score events in production: Performance. Thousand of events are scored per second. Scoring must be fast and scalable. Events must be scored and processed in tens of milliseconds. Every missed bid represents lost revenue. Portability. Models are self-describing and encapsulated and because of this, models can be moved easily from development to production environments and can also be ported easily between different systems and applications. Open Source. Open source software is particularly well suited to big data applications and cloud-based deployments since the software can be installed as required without worrying about licensing costs. With open source software, AdReady was able to keep its costs down and still scale out as required out. Having the source code available allows AdReady to insert custom code when required, rather than outside of a workflow, where any extra steps may require precious milliseconds. AdReady selected the open source Augustus system both to build the statistical models they required and also to deploy the models they built into their ad delivery system. Models are being deployed in Amazon s elastic infrastructure and models are moved between the statistical modeling environment and the production environment using Predictive Model Markup Language (PMML), the most widely deployed standard for describing statistical and data mining models. 2

Online Target Advertising AdReady is creating a fast, robust, and scalable ad-bidding system that allows custom campaigns to bid for web-page advertising space. The system uses sophisticated predictive models to identify bids that have high probability of winning, at a good price, on sessions likely to lead to conversion. AdReady s deployment environment was based on the following principles: Quickly deploy models. Models are built in a statistical modeling environment and exported as PMML. PMML (the Predictive Model Markup Language) is an XML-based standard for statistical modeling that can be visually inspected, tested in different scoring engines, internally validated with embedded test-data, and manipulated with simple scripts. PMML models are then imported into what is called a scoring engine that is integrated in the operational environment. Updating a model is as simple as reading a PMML file. No coding is required. Scale using Amazon s elastic infrastructure. With Amazon s elastic infrastructure, all that is required to scale is to add additional EC2 instances with embedded scoring engines. With this approach, AdReady will be able to scale to 11,000 events per second. Don t touch disk. By linking Apache web servers and Augustus scoring engines using Amazon s distributed environments, scoring events using PMML statistical models could be done in tens of milliseconds. Leverage open source software whenever possible. By using open source software, AdReady is integrating statistical modeling into a a large-scale ad delivery system without incurring licensing fees. In addition to Augustus, AdReady also uses Apache and Django, a Python MVC web framework which can be tightly integrated with Augustus. Ad-bidding systems combine three extremes of statistical processing: large, fine-grained models, robust, high-volume throughput, and heterogeneous data. To solve all three problems, AdReady incorporated the Augustus statistical toolkit into their model-production and scoring workflows using Amazon s Web Services. 3

Predictive Models AdReady is employing segmented models in order to obtain more accuracy over highly heterogeneous data. A segmented model is a collection of complete models, one or more of which are selected based upon the input event. The entire collection of segmented models can be encapsulated as a single PMML file and uploaded into each elastic scoring instance. The model file runs in-memory, guaranteeing that 1) new instances always have the latest model and 2) models can be updated independently of machine images. Throughput To deploy the model, Augustus was integrated into the online system by invoking it as a Python library from the Django framework. This allows AdReady to bypass the normal data pipeline and connect the scoring engine directly into their web framework. Augustus is written in Python and NumPy to combine a Python-based coding environment with highly scalable numeric processing. Since the data are processed without reading or writing to disks, the whole system will be able to respond to web requests within 100 ms. If more capacity is required, copies of the entire stack can be launched as virtual machine (scale out instead of scale up). Data The production system must be robust against missing or corrupt data. Data comes from cookies and session metadata, and is enriched from backend content sources. PMML provides a framework for making decisions about invalid data at runtime, and the Python interface simplifies the handling of invalid data by passing values with inhomogeneous types. Unexpected input results in fallback procedures, rather than uncaught exceptions. Development Cycle The entire system is being developed on a tight schedule of four months from start to finish. This pace is possible because of transparency at all levels: PMML models are human-readable text files, Augustus is open-source, and the Python interface has a rapid development cycle. These conveniences for the programmers does not limit scalability because all large-scale numeric 4

calculations are performed in NumPy, which is based on compiled C and Fortran libraries. Model consistency is assisted by the internal validation features of PMML: sample data and expected results are embedded in each model file and can be tested automatically. This allows for Open Data and AdReady to pass working examples or test cases in need of attention or analysis back and forth within the models. This project is a collaboration between AdReady and Open Data Group. 5

About Open Data Open Data Group specializes in building predictive models over big data and is one of the pioneers using technologies such as Hadoop, NoSQL databases, and elastic Infrastructure as a Service so that companies can build predictive models efficiently over all of their data. Open Data Group provides outsourced analytical services, management consulting services, analytic staffing, and expert witnesses broadly related to data and analytics. It has been building predictive models over big data for over ten years and has introduced a variety of innovative technology related to predictive modeling and analytic architectures. AdReady AdReady delivers the power and sophistication of multiple best-in-class enterprise software solutions for digital display advertising in an elegant, costeffective, yet highly powerful next-generation platform. The AdReady technology platform provides: Highly scalable campaign creation, testing and iteration Creative production, iteration and testing automation Integrated access to micro-targeted audiences through publishers, exchanges, networks and data providers Geo, Demo, Behavioral and Re-Targeting capabilities Proven optimization algorithms Augustus Augustus is an Apache 2.0-licensed open source system for building and scoring statistical models designed to work with data sets that are too large to fit into memory. More information is available at http://augustus.googlecode.com. PMML Predictive Model Markup Language (PMML) is the leading standard for statistical and data mining models and supported by over 20 vendors and organizations. PMML is an XML mark up language to describe statistical and data mining models. More information is available at http://dmg.org. 6