On Generating High InfoQ with Bayesian Networks Ron S. Kenett Research Professor, University of Turin Chairman and CEO, KPA Ltd. ron@kpa-group.com Based on joint work with Galit Shmueli, Silvia Salini and Federica Cugnata 1
Applied statistics is about meeting the challenge of solving real world problems with mathematical tools and statistical thinking 2
Mathematical Tools Statistical Thinking 3
Mathematical Tools Statistical Thinking Real World Problems 4The mathematical statistician
Mathematical Tools Statistical Thinking Real World Problems 5The proactive statistician, G. Hahn
Presentation topics Bayesian networks Information Quality Robustness of network Sensitivity of predictions information knowledge numbers data findings statistical analysis Examples from: Customer surveys Risk management of telecom systems Monitoring of bioreactors Managing healthcare of ESRD patients Kenett, Ron S., Applications of Bayesian Networks (2012). Available at SSRN: http://ssrn.com/abstract=2172713 6or http://dx.doi.org/10.2139/ssrn.2172713
Information Quality (InfoQ) The potential of a particular dataset to achieve a particular goal using a given empirical analysis method g X f U A specific analysis goal The available dataset An empirical analysis method A utility measure InfoQ(f,X,g) = U( f(x g) ) Kenett, R.S. and Shmueli, G. (2014) On Information Quality, Journal of the Royal Statistical Society, Series A 7(with discussion), Vol. 177, No. 1, pp. 3-38, 2014. http://ssrn.com/abstract=1464444.
Domain Space Knowledge Analytic Space Information Quality Bayesian networks InfoQ Robustness of network Sensitivity of predictions Goals InfoQ(f,X,g) = U(f(X g)) 1. Data resolution 2. Data structure 8 Data Quality Primary Data Analysis Quality Secondary Data - Experimental - Experimental - Observational - Observational 3. Data integration 4. Temporal relevance 5. Chronology of data and goal 6. Generalizability 7. Operationalization 8. Communication
f = Bayesian Networks Judea Pearl 2011 Turing Medalist 9
10 Causal calculus, Counterfactuals, Do calculus, Transportability, Missing data models, Causal mediation, Graph mutilation, External validity, Causality in missingness,.
Customer Surveys Behaviors Recommendation Loyalty Loyalty Index Attitudes & Perceptions Experiences & Interactions Overall Satisfaction Perceptions Equipment and System Sales Support Technical Support Training Supplies and Orders Software Add On Solutions Customer Website Purchasing Support Contracts and Pricing System Installation 11
Customer Surveys Goals Goal 1. Goal 2. Goal 3. Goal 4. Goal 5. Goal 6. Goal 7. Goal 8. Goal 9. Goal 10. Decide where to launch improvement initiatives Highlight drivers of overall satisfaction Detect positive or negative trends in customer satisfaction Identify best practices by comparing products or marketing channels Determine strengths and weaknesses Set up improvement goals Design a balanced scorecard with customer inputs Communicate the results using graphics Assess the reliability of the questionnaire Improve the questionnaire for future use 12
Bayesian Network Analysis of Customer Surveys 13 Kenett, R.S. and Salini S. (2009). New Frontiers: Bayesian networks give insight into surveydata analysis, Quality Progress, pp. 31-36, August.
14
20% BOT12 Set up improvement goals 15
39% BOT12 Set up improvement goals 16
13% BOT12 Set up improvement goals 17
Robustness and Sensitivity Analysis of Bayesian Networks How are the factors in the model affecting the target variable levels? Which factors affect customer satisfaction and how How are factors affecting the distance between conditioned distributions and observed data? Which factors will have the largest impact if changed 18
Network learning Robustness of Network bnlearn: 1) Grow-Shrink (GS) algorithm, 2) Incremental Association (IAMB) algorithm, 3) Interleaved-IAMB (Inter-IAMB) algorithm, 4) Fast-IAMB (Fast-IAMB) algorithm, 5) Max-Min Parents and Children (MMPC) algorithm, 6) ARACNE and Chow-Liu algorithms, 7) Hill- Climbing (HC) greedy search algorithm, 8)Tabu Search (TABU) algorithm, 9) Max-Min Hill-Climbing (MMHC) algorithm and 10) two-stage Restricted Maximization (RSMAX2) algorithm for both discrete and Gaussian networks 19
20 Robustness Analysis
Network estimation Sensitivity to Driver Variables Generate a full factorial experiments based on driver combinations. Analyse the target variable thus generated (overall satisfaction) with respect to the observed one, in order to study the effect that each driver combinations has on its variability. 21
Sensitivity Analysis 50,0% 45,0% Target variable: Satisfaction 44,0% 40,0% 35,0% 30,0% 25,0% 26,4% 20,0% 15,0% 10,0% 5,0% 4,6% 9,7% 15,3% 0,0% 1 2 3 4 5 22
Sensitivity Analysis 1000 simulated target variable using the BN estimated on the observed dataset. Distribution of simulated target variable for each level of each variable Equipment Sales Technical Support Mean 2.995 3.005 3.015 Mean 2.995 3.005 3.015 Mean 2.995 3.005 3.015 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 Impact of factors on target variable Suppliers Administrative Support Terms and Conditions Mean 23 2.995 3.005 3.015 1 2 3 4 5 Mean 2.995 3.005 3.015 1 2 3 4 5 Mean of simulate target variable for each level of each variable (ABC) Mean 2.995 3.005 3.015 1 2 3 4 5
Sensitivity Analysis Compare the observed distribution with the simulated one (1000 runs) using Kolmogorov-Smirnov test Equipment Sales Technical Support Mean of KS 0.256 0.260 0.264 Mean of KS 0.256 0.260 0.264 Mean of KS 0.256 0.260 0.264 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 Impact of factors on conditioned target variable distribution Suppliers Administrative Support Terms and Conditions Mean of KS 24 0.256 0.260 0.264 Mean of KS 0.256 0.260 0.264 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 Mean of Kolmogorov-Smirnov for each level of each variable (ABC) Mean of KS 0.256 0.260 0.264
Information Quality (InfoQ) of Integrated Analysis f(f 1,f 2, f n ) Models f 1 f 2 f 3 f 4. N f Goals g 1 4 g 2 2 g 3 2 g 4 1 N g 1 2 3 3 Kenett R.S. and Salini S. (2011). Modern Analysis of Customer Surveys: comparison of models and integrated 25 analysis, with discussion, Applied Stochastic Models in Business and Industry, 27, pp. 465 475
Goal BN CUB Rasch CC 1 Decide where to launch improvement initiatives 2 Highlight drivers of overall satisfaction 3 Detect positive or negative trends in customer satisfaction 4 Identify best practices by comparing products or marketing channels 5 Determine strengths and weaknesses 6 Set up improvement goals 7 Design a balanced scorecard with customer inputs 8 Communicate the results using graphics 9 Assess the reliability of the questionnaire 10 Improve the questionnaire for future use Kenett R.S. and Salini S. (2011). Modern Analysis of Customer Surveys: comparison of models and integrated 26 analysis, with discussion, Applied Stochastic Models in Business and Industry, 27, pp. 465 475
Goal 1: Identify causes of risks that materialized Goal 2: Design risk mitigation strategies Goal 3: Provide a risk management dashboard 27
28 Bayesian network of communication network data
Conditioned BN Experimental Array Robustness of Prediction 29
Conditioned BN Experimental Array Robustness of Prediction Classification error runs i 1 I ( ( x) s s ) i i 30 BAYESIAN NETWORKS: ROBUSTNESS AND SENSITIVITY ISSUES By Federica Cugnatay, Ron Kenetty,z and Silvia Salini http://users.unimi.it/salini/rsbn/data.zip http://users.unimi.it/salini/rsbn/rcodes.zip http://users.unimi.it/salini/rsbn/materials.html
Analysis: Main effect plots of mean GoF Main Effects Plot (data means) for GoF mean 61.0 Lines Extensions 60.5 60.0 Mean of GoF mean 59.5 61.0 60.5 60.0 59.5 1 1 2 2 3 Smart phones 3 4 5 4 6 1 2 3 Analyzing the mean GoF for each factor, the most robust levels are Lines=3, Smart Phone=3, and Extensions=3 31 4 5
Analysis: Main effect plots of mean GoF 5 4 Contour Plot of GoF mean vs Extensions, Lines GoF mean < 59.0 59.0-59.5 59.5-60.0 60.0-60.5 60.5-61.0 > 61.0 Extensions 3 Hold Values Smart phones 3.5 2 1 1.0 1.5 2.0 2.5 Lines 3.0 3.5 4.0 A central number of lines and a central level for extension is associated with a low mean GoF 32
Analysis: Main effect plots of mean GoF Smart phones Contour Plot of GoF mean vs Smart phones, Lines 6 5 4 3 GoF mean < 59 59-60 60-61 61-62 > 62 Hold Values Extensions 3 2 1 1.0 1.5 2.0 2.5 Lines 3.0 3.5 4.0 A limited number of lines and a limited level for smart 33 phone is associated with a low mean GoF
Analysis: Main effect plots of mean GoF Contour Plot of GoF mean vs Smart phones, Extensions Smart phones 6 5 4 3 GoF mean < 58.5 58.5-59.0 59.0-59.5 59.5-60.0 60.0-60.5 60.5-61.0 61.0-61.5 > 61.5 Hold Values Lines 2.5 2 1 1 2 3 Extensions 4 5 A limited level both for extension and smart phone is associated 34 with a low mean GoF
Knowledge Goals Information Quality InfoQ(f,X,g) = U(f(X g)) Information Quality Data Quality Primary Data Analysis Quality Secondary Data - Experimental - Experimental - Observational - Observational g X f U A specific analysis goal The available dataset An empirical analysis method A utility measure 1. Data resolution 2. Data structure 3. Data integration 4. Temporal relevance What How 5. Chronology of data and goal 6. Generalizability 7. Operationalization 8. Communication 35
Big Data Analytics InfoQ(f,X,g) = U(f(X g)) 1. Data resolution 2. Data structure 3. Data integration 4. Temporal relevance 5. Chronology of data and goal 6. Generalizability 7. Operationalization 8. Communication Russom, P., Big Data Analytics, TDWI Best Practices Report, Q4 2011 36
Knowledge 1. Data resolution 2. Data structure 3. Data integration 4. Temporal relevance 5. Chronology of data and goal 6. Generalizability 7. Construct operationalization 8. Communication Data Quality Information Quality Analysis Quality Goals Bayesian Network 37
Thank you for your attention What did we cover? Statistical thinking Bayesian networks Information Quality (InfoQ) Robustness of network Sensitivity of predictions