Stata for micro targeting using C++ and ODBC



Similar documents
In-Database Analytics

How To Make A Credit Risk Model For A Bank Account

Accessing Your Database with JMP 10 JMP Discovery Conference 2012 Brian Corcoran SAS Institute

Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes

Marketing Strategies for Retail Customers Based on Predictive Behavior Models

4D and SQL Server: Powerful Flexibility

SCORE An Overview. State of Colorado Registration and Election Management

XMailer Reference Guide

El poder de la Programación de Excel y Visual Basic User Review --->>> Enter Here More Details => VISIT HERE

Y Local 4GB SQL Server instance. Y Via generic ODBC connection. Adding full support for Aster and Excel 1st quarter next year.

Understanding the Benefits of IBM SPSS Statistics Server

KnowledgeSEEKER POWERFUL SEGMENTATION, STRATEGY DESIGN AND VISUALIZATION SOFTWARE

SEIZE THE DATA SEIZE THE DATA. 2015

Brian LaGoe, Systems Administrator Benjamin Jellema, Systems Administrator Eastern Michigan University

Nine Common Types of Data Mining Techniques Used in Predictive Analytics

KnowledgeSTUDIO HIGH-PERFORMANCE PREDICTIVE ANALYTICS USING ADVANCED MODELING TECHNIQUES

KnowledgeSEEKER Marketing Edition

Knowledge Discovery and Data Mining

California Democratic Party INTRODUCTION TO MOE (Mobilize, Organize, Elect)

Producing Listings and Reports Using SAS and Crystal Reports Krishna (Balakrishna) Dandamudi, PharmaNet - SPS, Kennett Square, PA

irods and Metadata survey Version 0.1 Date March Abhijeet Kodgire 25th

Contactegration for The Raiser s Edge

1 File Processing Systems

In Memory Accelerator for MongoDB

How to Connect to CDL SQL Server Database via Internet

Journée Thématique Big Data 13/03/2015

TU04. Best practices for implementing a BI strategy with SAS Mike Vanderlinden, COMSYS IT Partners, Portage, MI

Title. Syntax. stata.com. odbc Load, write, or view data from ODBC sources. List ODBC sources to which Stata can connect odbc list

Development at the Speed and Scale of Google. Ashish Kumar Engineering Tools

Upgrade Considerations for etrust Audit and etrust Security Command Center r8 SP2

Architecting the Future of Big Data

Customer Management TRIAD version 2.0

SQL Azure and SqlBulkCopy

WebFOCUS RStat. RStat. Predict the Future and Make Effective Decisions Today. WebFOCUS RStat

ICOM 6005 Database Management Systems Design. Dr. Manuel Rodríguez Martínez Electrical and Computer Engineering Department Lecture 2 August 23, 2001

Optimizing LTO Backup Performance

IT of SPIM Data Storage and Compression. EMBO Course - August 27th! Jeff Oegema, Peter Steinbach, Oscar Gonzalez

2015 Workshops for Professors

Your Best Next Business Solution Big Data In R 24/3/2010

DSS. Diskpool and cloud storage benchmarks used in IT-DSS. Data & Storage Services. Geoffray ADDE

Knocker main application User manual

Real-Time Analytics on Large Datasets: Predictive Models for Online Targeted Advertising

Setting Up ALERE with Client/Server Data

Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011

White Paper. PiLab Technology Performance Test, Deep Queries. pilab.pl

INTRODUCTION: SQL SERVER ACCESS / LOGIN ACCOUNT INFO:

DiskBoss. File & Disk Manager. Version 2.0. Dec Flexense Ltd. info@flexense.com. File Integrity Monitor

SQL Server 112 Success Secrets. Copyright by Martha Clemons

Knowledge Discovery and Data Mining

A Near Real-Time Personalization for ecommerce Platform Amit Rustagi

Using IRDB in a Dot Net Project

Optional custom API wrapper. C/C++ program. M program

ISAM TO SQL MIGRATION

Integrating Apache Spark with an Enterprise Data Warehouse

Drupal 6 to Drupal 7 Migration Worksheet

Troubleshooting A Database Connection

Captivate Your Mobile Customers

Oracle BI EE Implementation on Netezza. Prepared by SureShot Strategies, Inc.

How To Understand The Error Codes On A Crystal Reports Print Engine

Cheminformatics and Pharmacophore Modeling, Together at Last

SQL Server 2012 PDW. Ryan Simpson Technical Solution Professional PDW Microsoft. Microsoft SQL Server 2012 Parallel Data Warehouse

Using the SQL Server Linked Server Capability

METADATA DRIVEN INTEGRATED STATISTICAL DATA PROCESSING AND DISSEMINATION SYSTEM

Product Information. Sugar vs Zoho. Features Comparison

2015, André Melancia (Andy.PT) 1

32-bit and 64-bit BarTender. How to Select the Right Version for Your Needs WHITE PAPER

The SAS Solution for Enterprise Marketing Automation - Opening Session Demo

The Predictive Data Mining Revolution in Scorecards:

The Power of Predictive Analytics

Data Mining Practical Machine Learning Tools and Techniques

BI xpress Product Overview

Gladstone Health & Leisure Technical Services

Chimpegration for The Raiser s Edge

How to Configure Informix Connect and ODBC

Teradata SQL Assistant Version 13.0 (.Net) Enhancements and Differences. Mike Dempsey

Research-supported & Data-driven Methods to Pass a Levy Campaign. Justin Zemanski, Social Studies Educator, 3rd year Doc.

Archive Server for MDaemon Keep track of all your ! Save that information in a safe place and retrieve it in a snap.

DiskPulse DISK CHANGE MONITOR

Importing Data from Microsoft Access into LISTSERV Maestro

Cloud Computing. With MySQL and Pentaho Data Integration. Matt Casters Chief Data Integration at Pentaho Kettle project founder

Segmentation and Data Management

Data Mining. Nonlinear Classification

A Comparison of Decision Tree and Logistic Regression Model Xianzhe Chen, North Dakota State University, Fargo, ND

Campus Network Design Science DMZ

Decision Trees from large Databases: SLIQ

ITG Software Engineering

Local Area Networks: Software and Support Systems

Service Desk Intelligence System Requirements

Software: Systems and. Application Software. Software and Hardware. Types of Software. Software can represent 75% or more of the total cost of an IS.

CA Workload Automation Agents for Mainframe-Hosted Implementations

SysPatrol - Server Security Monitor

How to Optimize Your Data Mining Environment

The Economist/YouGov Poll

Microsoft' Excel & Access Integration

Transcription:

Stata for micro targeting using C++ and ODBC Masahiko Aida

Page 2 The Aim of the Research Outline of presentation What is micro-targeting? Work flows Utilizing best part of three computing platform Stata for its flexibility. SQL server for handling large data. C++ for fast complex calculation.

Page 3 What is targeting? Political campaigns targeted voters to convey candidate s views or to encourage voting (GOTV). Targeting using macro level data Buy ad spots by media market (defined by county). Send canvassers to particular precincts. Nature of macro level targeting Pros: Candidate-level breakdown of votes are only reported by precinct level. Cons: Not very efficient.

Page 4 How does Micro-targeting Work? Attitudes Behaviors 1. Create a model with training data 2. Extrapolate and estimate predicted values for all registered voters. Voter file data (age, gender, party registration & vote histories) Census tract/block summary data Consumer data Predicted values from the model

Page 5 Statistical Models for Micro-Targeting Required properties for targeting models Large number of covariates. Incorporating complex interaction terms. Robustness. Need to avid over fitting. Working solutions for above requirement (2009 version) Decision tree models. Model averaging (bagging or boosting). Cross validation. Use (political) common sense.

Page 6 Workflow of Micro-Targeting Analysis & Scoring (1) Survey Data From call center ODBC connection Voter files are stored (2) Recode, Export recoding syntax as C++ (4) Score propensity in SQL server clean & Analyze code Load plug-in (3) Modeling & validation Data mining software Export model Dynamic link Library in C++ Direct mail consultant Telphone Contact Canvassing as C++code Survey Sampling

Page 7 How to Make Plug-ins stplugin.c Works as a bridge Between your C++ Code and Stata scoring.c Scoring function Written in C++ Compile (ex. gcc) DLL Now Stata has new function!! Wrapper.c Load/export data Within memory Of Stata to functions Written in C++ recoding.c Translate generate, replace syntax of Stata into C++

Page 8 Example Example of scoring 4 th random subset (2 nd argument) of Michigan voter file (1 st argument). capture program drop Scoring program define Scoring, rclass odbc load, dialog(complete) dsn( xxx") clear sqlshow exec("select VAR1, VAR2 FROM Table_`1 WHERE Rand = `2 ) do recoding.do gen Predicted =. plugin call Scoring VAR1 VAR2 Predicted save w:\scores\score_`1 _`2, replace end clear Scoring MI 4

Page 9 Is Plug-ins Worth Investing? It is not feasible to write scoring syntax for boosted/bagged model using SQL command or Stata do file. Time needed for coding in C++ becomes insignificant for large applications. Expected time to score AR voter file Expected time to score CA voter file Singe Decision Tree (100 nodes) 241 sec 39 min Boosted Tree 30 trees (100 nodes) 316 sec 51 min Boosted Tree 300 trees (100 nodes) 914 sec 147 min 21 predictors Bagged Tree 300 trees (248 nodes) 62 min 596 min N= 67,177

Page 10 Putting It in Perspective Loading Data via ODBC: 49% of total process Under windows 2003 server platform, ODBC seems fast enough. DB should be indexed by the query variable. SQL server can act faster if more memory/cpus are dedicated. Scoring Data (Bagging): 48% of total process This of course depends on complexity of the model. Recoding: 2% of total process Recoding data that are already in Stata s memory is fast. Saving/exporting data: 1% of total time Saving Stata file in fast disk array (RAID 10) requires negligible time.

Page 11 Summary Stata s flexibility and various functions make it ideal platform for data analysis and data management. However, relational database can handle large data with much ease. Specialized software (ex. data mining) can run certain analysis more efficiently (decision tree & boosting), and many of them can spit C++ codes. Stata can speak both ODBC and C++.