Data Mining in Pharmaceutical Marketing and Sales Analysis. Andrew Chabak Rembrandt Group



Similar documents
Statistics and Data Mining

Principles of Data Mining by Hand&Mannila&Smyth

Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics.

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012

SPSS ADVANCED ANALYSIS WENDIANN SETHI SPRING 2011

EXPLORING & MODELING USING INTERACTIVE DECISION TREES IN SAS ENTERPRISE MINER. Copyr i g ht 2013, SAS Ins titut e Inc. All rights res er ve d.

A Review of Data Mining Techniques

Statistical Models in Data Mining

Data Mining + Business Intelligence. Integration, Design and Implementation

The KDD Process for Extracting Useful Knowledge from Volumes of Data

Database Marketing, Business Intelligence and Knowledge Discovery

DISCRIMINANT FUNCTION ANALYSIS (DA)

Master of Science in Health Information Technology Degree Curriculum

Machine Learning with MATLAB David Willingham Application Engineer

How To Understand The Theory Of Probability

Additional sources Compilation of sources:

Simple Predictive Analytics Curtis Seare

DATA MINING TECHNIQUES AND APPLICATIONS

Why do statisticians "hate" us?

Big Data: Rethinking Text Visualization

Qualitative vs Quantitative research & Multilevel methods

Statistics for BIG data

OLAP and Data Mining. Data Warehousing and End-User Access Tools. Introducing OLAP. Introducing OLAP

Silvermine House Steenberg Office Park, Tokai 7945 Cape Town, South Africa Telephone:

Regression Modeling Strategies

ANALYTICS CENTER LEARNING PROGRAM

Data Warehousing and Data Mining for improvement of Customs Administration in India. Lessons learnt overseas for implementation in India

Machine Learning and Data Mining. Fundamentals, robotics, recognition

UNIVERSITY OF NAIROBI

Internal Consultant Training

PREDICTIVE ANALYTICS: PROVIDING NOVEL APPROACHES TO ENHANCE OUTCOMES RESEARCH LEVERAGING BIG AND COMPLEX DATA

Introduction. A. Bellaachia Page: 1

Data Mining and Soft Computing. Francisco Herrera

Data Mining Applications in Higher Education

MSCA Introduction to Statistical Concepts

A STUDY ON DATA MINING INVESTIGATING ITS METHODS, APPROACHES AND APPLICATIONS

The University of Jordan

Faculty of Science School of Mathematics and Statistics

Data Mining: An Overview. David Madigan

How To Use Neural Networks In Data Mining

In this presentation, you will be introduced to data mining and the relationship with meaningful use.

Applying Statistics Recommended by Regulatory Documents

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

Business Analytics and Credit Scoring

Handling attrition and non-response in longitudinal data

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

Data Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression

Data analysis process

An Overview of Data Mining: Predictive Modeling for IR in the 21 st Century

The Data Mining Process

Data Mining and Exploration. Data Mining and Exploration: Introduction. Relationships between courses. Overview. Course Introduction

Perspectives on Data Mining

Introduction to Data Mining and Machine Learning Techniques. Iza Moise, Evangelos Pournaras, Dirk Helbing

TDWI Best Practice BI & DW Predictive Analytics & Data Mining

large-scale machine learning revisited Léon Bottou Microsoft Research (NYC)

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets

Learning outcomes. Knowledge and understanding. Competence and skills

Data Mining: An Introduction

STATISTICA Formula Guide: Logistic Regression. Table of Contents

Introduction to Data Mining

Social Media Mining. Data Mining Essentials

Employer Health Insurance Premium Prediction Elliott Lui

Information Management course

Gerard Mc Nulty Systems Optimisation Ltd BA.,B.A.I.,C.Eng.,F.I.E.I

Knowledge Discovery from patents using KMX Text Analytics

Regression III: Advanced Methods

Information Visualization WS 2013/14 11 Visual Analytics

IMPROVING DATA INTEGRATION FOR DATA WAREHOUSE: A DATA MINING APPROACH

Least Squares Estimation

An Overview of Database management System, Data warehousing and Data Mining

BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES

KATE GLEASON COLLEGE OF ENGINEERING. John D. Hromi Center for Quality and Applied Statistics

Statistics Graduate Courses

Customer Classification And Prediction Based On Data Mining Technique

Data Mining for Successful Healthcare Organizations

INDIAN STATISTICAL INSTITUTE announces Training Program on Statistical Techniques for Data Mining & Business Analytics

Overview of Non-Parametric Statistics PRESENTER: ELAINE EISENBEISZ OWNER AND PRINCIPAL, OMEGA STATISTICS

Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal

Data Mining Part 5. Prediction

Richard M. Heiberger Professor Emeritus of Statistics Temple University January 2016

II. DISTRIBUTIONS distribution normal distribution. standard scores

Data Mining and Machine Learning in Bioinformatics

Index Contents Page No. Introduction . Data Mining & Knowledge Discovery

Predictive Modeling Techniques in Insurance

An Overview of Knowledge Discovery Database and Data mining Techniques

Cleaned Data. Recommendations

TABLE OF CONTENTS. About Chi Squares What is a CHI SQUARE? Chi Squares Hypothesis Testing with Chi Squares... 2

Machine Learning and Data Analysis overview. Department of Cybernetics, Czech Technical University in Prague.

Chapter ML:XI. XI. Cluster Analysis

Publication List. Chen Zehua Department of Statistics & Applied Probability National University of Singapore

Mathematics within the Psychology Curriculum

Imputing Values to Missing Data

Copyright 2009 SAS Institute Inc. All rights reserved. Success With Business Analytics in the New Pharmaceutical Commercial Model.

Combining Linear and Non-Linear Modeling Techniques: EMB America. Getting the Best of Two Worlds

Transcription:

Data Mining in Pharmaceutical Marketing and Sales Analysis Andrew Chabak Rembrandt Group 1

Contents What is Data Mining? Data Mining vs. Statistics: what is the difference? Why Data Mining is important tool in pharmaceutical marketing research and sales analysis? Case Study 2

What is the Data Mining? The magic phrase to put in every funding proposal you write to NSF, DARPA, NASA, etc Data Mining is a process of torturing the data until they confess The magic phrase you use to sell your.. - database software - statistical analysis software - parallel computing hardware - consulting services 3

Data Mining Data Mining is a cutting edge technology to analyze diverse, multidisciplinary and multidimensional complex data is defined as the non-trivial iterative process of extracting implicit, previously unknown and potentially useful information from your data Data mining could identify relationships in your multidimensional and heterogeneous data that cannot be identified in any other way Successful application of state-of-the-art data mining technology to marketing, sales, and outcomes research problems (not to mention drug discovery) is indicative of analytic maturity and the success of a pharmaceutical company 4

Data Mining and Related Fields Statistics Database Machine Learning Data Mining Visualization Is Data Mining extension of Statistics? 5

Statistics vs. Data Mining: Concepts Feature Statistics Data Mining Type of Problem Well structured Unstructured / Semi-structured Inference Role Objective of the Analysis and Data Collection Explicit inference plays great role in any analysis First objective formulation, and then - data collection No explicit inference Data rarely collected for objective of the analysis/modeling Size of data set Data set is small and hopefully homogeneous Data set is large and data set is heterogeneous Paradigm/Approach Theory-based (deductive) Synergy of theory-based and heuristicbased approaches (inductive) Signal-to-Noise Ratio STNR > 3 0 < STNR <= 3 Type of Analysis Confirmative Explorative Number of variables Small Large 6

Statistics vs. Data Mining: Regression Modeling Feature Statistics Data Mining Number of inputs Small Large Type of inputs Multicollinearity Distributional assumptions, homoscedasticity, outliers, missing values Type of model Interval scaled and categorical with small number of categories (percentage of categorical variables is small) Wide range of degree of multicollinearity with intolerance to multicollinearity Intolerance to distrubitional assumption violation, homoscedasticity, Outliers/leverage points, missing values Linear / Non-linear / Parametric / Non- Parametric in low dimensional X-space (intolerance to uncharacterizable nonlinearities) Any mixture of interval scaled, categorical, and text variables Severe multicollinearity is always there, tolerance to multicollinearity Tolerance to distributional assumption violation, outliers/leverage points, and missing values Non-linear and non-parametric in high dimensional X-space with tolerance to uncharacterizable non-linearities 7

What is an unstructured problem? Definition Well-structured Business Problem Can be described with a high degree of completeness Can be solved with a high degree of certainty Experts usually agree on the best method and best solution Can be easily and uniquely translated into quantitative counterpart Unstructured Business Problem Cannot be described with a high degree of completeness Cannot be resolved with a high degree of certainty Experts often disagree about the best method and best solution Cannot be easily and uniquely translated into quantitative counterpart Goal Find the best solution Find reasonable solution Complexity Ranges from very simple to complex Ranges from complex to very complex Example Project: Sample size calculation Key business question: what is the physician sample size (PCP vs. OBGYN) to detect five script difference of Product A. sales? How to translate this business question into quantitative counterpart? No data No variables Project: Customer Feedback Study Key business question: Is there any relationship between customer perception and performance? How to translate this business question into quantitative counterpart? Data: 800 observations/interactions and 400 variables/attributes Variables are differently scaled 8

What are differences between Data Mining and Statistics? Statistical analysis is designed to deal with well structured problems: Results are software and researcher independent Inference reflects statistical hypothesis testing Data mining is designed to deal with unstructured problems Results are software and researcher dependent Inference reflects computational properties of data mining algorithm at hand 9

Is Data Mining extension of Statistics? Data Mining and Statistics: mutual fertilization with convergence Statistical Data Mining (Graduate course, George Mason University) Statistical Data Mining and Knowledge Discovery (Hardcover) by Hamparsum Bozdogan (Editor) An overview of Bayesian and frequentist issues that arise in multivariate statistical modeling involving data mining Data Mining with Stepwise Regression (Dean Foster, Wharton School) use interactions to capture non-linearities use Bonferroni adjustment to pick variables to include use the sandwich estimator to get robust standard errors 10

When data mining technology is appropriate? Data mining technology is appropriate if: The business problem is unstructured Accurate prediction is more important than the explanation The data include the mixture of interval, nominal, ordinal, count, and text variables, and the role and the number of non-numeric variables are essential Among those variables there are a lot of irrelevant and redundant attributes The relationship among variables could be non-linear with uncharacterizable nonlinearities The data are highly heterogeneous with a large percentage of outliers, leverage points, and missing values The sample size is relatively large Important marketing, sales, and outcomes research studies have the majority of these features 11

Accurate prediction is more important than the explanation 12

Case Study: Effectiveness Evaluation of Vaccine Sales Force New lunched vaccine (10 months in marketplace) Sales force structure: two sales team Team_1 Team_2 Some locations are visited only by Team_1 (1- up promotion), others by both teams (2-up promotion) Promotion is on doctors/hcp level, but sales is on location level Business question: What is the effectiveness of 2-up promotion? 13

Dependent Variables and Study Design Dependent variables (criteria to judge): Total dosage purchased (shipped or ordered?) Total dosage purchased per promotion dollar Probability of making a purchase Time to the first purchase Frequency of purchase, etc. Study Design: Test (2-up promotion locations) Control (1-up promotion locations) Consider Vaccine sales as two criteria problem: Estimate outcome for Test and Control groups, taking into account difference in sales difference in promotion cost Use pre-period data to match Test and Control groups and postperiod data to compare sales and promotion cost 14

Independent Variables 47 input variables Demographics of a location Geography, specialty, potentials, potential decile, average age, reimbursement, believes, etc. Promotion activities Number of direct interactions with HCP by Team_1 Number of direct interactions with HCP by Team_2 Percentage of direct interaction with decision maker by Team_1 Percentage of direct interaction with decision maker by Team_2 Number of phone interactions with HCP by Team_1 Number of phone interactions with HCP by Team_2, etc. 15

Variables Distribution Private Dosage Potential Number of Sales Calls Public dosage Potential Vaccine Dosage Sales 16

Methodology Form Test - Control groups, using only pre-period data and propensity score methodology with greedy one-toone matching technique on propensity score Develop models for the post-period data for total vaccine sales, controlling for location demographics variables promotion variables in pre-period sales in pre-period Estimate the difference in sales for Test and Control groups, taking into account promotion cost 17

Propensity Score Propensity score is the predicted probability of receiving the treatment (probability of belonging to a test group) is a function of several differently scaled covariates Propensity_Score = f (location demographics variables, promotion variables, sales variables), 0 < f < 1 where f is a non-parametric non-linear multivariate function A sample matched on propensity score will be similar across all covariates used to calculate propensity score 18

Propensity score with SAS EM Neural Net was the best modeling paradigm 19

Finding and Business Implication Team / Group Mean of Sales Call Test 5.29 Control 2.76 There is no algorithmically significant difference in sales between Test and Control groups, but promotion cost for Control group is two times lower than for Test group. In other words, 2-up structure does not produce desired/expected outcome for Vaccine performance 20

Phone call / Sales call response curve for Vaccine sales, constructed by TreeNet Number of Purchases has a strong diminishing returns effect when the number of sale calls becomes greater than thirty five and the number of Phone calls becomes greater than 2 Number of Purchases Number of Purchases Number of Phone calls Number of Sales calls 21

Phone call and Sales call response surface for Vaccine sales,, constructed by TreeNet Sales call is the most effective when a location gets two Phone calls Number of Purchase has a strong diminishing returns effect when the number of Sales calls becomes greater than thirty five and the number of Phone calls becomes greater than 2 22

Reference David J. Hand, Data Mining: Statistics and More? The American Statistician, May 1998, Vol. 52 No. 2 http://www.amstat.org/publications/tas/hand.pdf Friedman, J.H. 1997. Data Mining and Statistics. What s connection? Proceedings of the 29 th Symposium on the Interface: Computing Science and Statistics, May 1997, Houston, Texas Padhraic Smyth (2000), An Introduction to Data Mining, Elumetric.com Inc Doug Wielenga (2007), Identifying and Overcoming Common Data Mining Mistakes, SAS Global Forum Paper 073-2007 23