Ensemble Modeling with R
|
|
- Poppy Bradley
- 8 years ago
- Views:
Transcription
1 Doctoral Candidate/Merchandise Data Scientist MatthewALanham.com Virginia Tech Department of Business Information Technology Advance Auto Parts, Inc.
2 Outline Outline My Background and Research Pros and Cons of R for Data Science Modeling Using CRISP-DM Framework What is Ensemble Modeling? Fitting Models Bagging a decision tree Optimal Decision Cut Points for Binary Classification SEPTEMBER 15,
3 Background (2005) B.A. Economics/Mathematics, Indiana University-Bloomington ( ) Genscape, Inc., Louisville, KY Energy transparency start-up ( ) M.S. Biostatistics-Decision Science, University of Louisville ( ) M.S. Statistics, Virginia Tech ( Current) Ph.D. Business Information Technology, Virginia Tech ( Current) Advance Auto Parts, Inc. Fortune 500 Retailer (#402.. for now) Research Focus How can we build better predictive models that are empirically sound (stats) as input parameters to prescriptive models that are process representative (optimization) to provide the best (maintainable, timely, scalable, KPI fused) decision-support for a retailer s assortment plan? Why is assortment planning so important? Why is assortment planning problem so challenging? Where does Data Science & Big Data Analytics (BDA) come into play? Where does Information Technology (IT) come into play? Where does Business come into play?
4 Predictive and Prescriptive Analytics INTEGRATING PREDICTIVE AND PRESCRIPTIVE ANALYTICS Determine Optimal Solution (loop) Search Algorithm Prescriptive Model Optimality Conditions Decision Criteria Max Profit Decision Variables 1) Assortment 2) Prices 3) Promotion 4) Shelf Space* Decision Model Sales Model Performance Measures 1) Obj. Function - Revenue 2) Constraints - Budget(s) Estimation Model Data Predictive Model(s) Market Specs Demand Model Preference Structure Similarity Measures Utility Model Demand Forecast Parameter(s) Time Summary (TS) Ex: Sum, Avg. Time summary performance measures (TSPM) Scenario summary performance measures (SSPM) Scenario Summary (SS) Ex: Sum, Avg. Performance measures that are functions over a time horizon or random variables that must be summarized over their distributions SEPTEMBER 15,
5 Oracle + SPSS Modeler WHAT I M WORKING WITH CURRENTLY My opinion: IBM SPSS Modeler and SAS Enterprise Miner are: 1) Great for teaching 2) Great for stand-alone data mining projects 3) Visually appealing to Management 4) Not great for real-time production analytics 5) Not great for customized solutions 6) Not designed for prescriptive analytics An Example of an IBM SPSS Modeler stream building predictive models SEPTEMBER 17,
6 Data Mining, Data Science, and Predictive Modeling with R DATA SCIENCE WITH R R is an open-source and freely accessible software language under the GNU General Public License, version 2 agreement for statistical and mathematical computing (Ihaka & Gentleman, 1996). R is compatible with many operating systems such as with Windows, Macintosh, Unix, and Linux. According to Eric Sigel, Author of Predictive Analytics: The Power to Predict Who Will Click, Buy, Lie, or Die, R is The leading free, open-source software tool for PA (Predictive Analytics), has a rapidly expanding base of users as well as enthusiastic volunteer developers who add to and support its functionalities (Siegel, 2013). Today there are several thousand available user-developed packages (also referred to as libraries). Packages are collections of R functions, compiled code, and data put together in a specific format following CRAN s guidelines. You can search for packages by application area here ( As of July 2014, there are 33 different application areas. In the Machine Learning application area there are 72 different packages offering libraries that have functions to do nearly any methodology. There are many newer techniques available here that are not available in commercial software packages. Cons Memory, memory, memory! See memory_example.r SEPTEMBER 17,
7 Data Mining, Data Science, and Predictive Modeling with R WHAT IS ENSEMBLE MODELING? Ensemble methods train multiple predictive models and then combine the predictions to achieve a higher overall performance and stability. Pros Ensemble methods require little tuning Ensemble methods operate on a variety of input types (Categorical variables, Integers, and real numbers) Ensemble methods can be used on variety of problems (binary and multi-class classification and rankings, regression, etc.) SEPTEMBER 17,
8 CRISP-DM DATA MINING FRAMEWORK Cross-Industry Standard for Data Mining (CRISP-DM) is a general data mining process model that can be applied to solve any business problem. There are other popular data mining and analytics process models, such as Sample-Explore-Modify-Model- Assess (SEMMA), but in my opinion CRISP-DM is more structured and detailed. CRISP-DM was created and modified over time by leading practitioners and researchers in the data mining field and has been shown to lead to analytical results that align with business objectives. The CRISP-DM process model and techniques primarily fall under the predictive analytics domain in business analytics, where the objective is to help organizations predict future events and proactively act upon such insights in a systematic fashion to drive better business outcomes (Provost & Fawcett, 2013). However, this process could be extended to prescriptive (i.e. optimization) analytics endeavors as well. SEPTEMBER 15,
9 CRISP-DM CRISP-DM DETAILED VIEW CRISP-DM Model Phases and Tasks (Source: Modified from SEPTEMBER 15,
10 Business Understanding BUSINESS UNDERSTANDING Business Objectives Retail Assortment planning, at the most basic level asks which products to offer and how many (Mantrala et al., 2009). Assortment planning is one of the most important decisions faced by retailers (Sauré & Zeevi, 2013). Because of financial and physical capacity constraints, operationally a retailer does not have to ability to stock, let alone hold in store every possible product a consumer may desire (Sauré & Zeevi, 2013). You must get the project sponsor to detail the business success criteria. It s not some predictive model accuracy statistic. Examples: Increased Sales of X% at Stores Y and Z. Reduced non-working inventory of W% at Stores Y and Z. Assess Situation May use R and any of its available packages Deadline is September 17 th at Meetup Competition winner gets $100, losers learn something, Speaker gets feedback Data Mining Goals Determine best overall test accuracy on a 10% out-of-training set Neither the sensitivity nor specificity must fall below 0.70 on the out-oftraining set to qualify. Project Plan Layout your expected work schedule, breaks, etc. Will vary depending on your experience using R SEPTEMBER 15,
11 Data Understanding DATA UNDERSTANDING Collect Data Describe Data?? Variable Description store_number A unique store identifier sku_number A unique SKU identifier Y SOLD The SKU in a respective store sold (1=yes, 0=no) sold in the last 13 periods after it was replinished/maxied. NUM_SOLD The number of realized unit SKU sales for a respective store over the past 1-13 periods. X NUM_SOLD_LAST The number of realized unit SKU sales for a respective store over the past periods. X application_count The total number of different year-make-model vehicle options that the respective SKU could be used for. X projected_growth_pct The projected percentage growth for this SKU in the next 13 periods based on financial experts. X offset For each store-sku, the positive deviation based on unit sales from the center of the part-type-specific distribution For each store-sku the positive deviation based on unit sales from the center of the part-type-specific distribution adjusted based on an ad-hoc X adjusted_offset calculation X unit_sales_py The total number of units sold for this particular SKU over all stores for between the past 27 and 39 periods. X unit_sales_cy The total number of units sold for this particular SKU over all stores for the past 14 and 26 periods. X unit_sales_fy The total number of units sold for this particular SKU over all stores over the past 13 periods. X total_vio The total number of "estimated" vehicles in operations associated to a particular store based on an ad-hoc calculation. X adjusted_total_vio The total number of "estimated" vehicles in operations associated to a particular store based on an ad-hoc calculation. The percentage of vehicles in operations (VIO) for a respective store compared to the total number of VIO for all stores associated to a cluster over X vio_compared_to_cluster the past 14 to 26 periods. X avg_cluster_cy_unit_sales The average number of SKUs sold based on a clustering of all stores over the past 13 to 26 periods. X avg_cluster_cy_total_sales The average number of total sales which is a combination of unit and lost sales based on store clusters over the past 14 to 26 periods. X avg_cluster_cy_lost_sales The average number of lost sales, clustered by all stores over the past 14 to 26 periods. X pop_est_cy Estimated number of persons in the population where the store is located based on the latest period. X pop_density_cy Estimated density (a percentage) of the population where the store is located based on the latest period. X pct_white Estimated number of caucasion-identified persons where the store is located based on the latest period. X age Estimated median person-age where the store is located over based on the latest period. X pct_college Estimated percentage of college-education persons where the store is located based on the latest period. X pct_blue_collar Estimated percentage of blue-collar type workers where the store is located based on the latest period. X median_household_income Estimated median household income where the store is located based on the latest period. Estimated number of physical locations where business is conducted or where services or industrial operations are performed where the store is X establishments located based on the latest period. X road_quality_index A measure of the quality of the roads in the area the store is located. Usually you will create such a table yourself but make it more descriptive. The data scientist will ask the domain expert(s) questions such as: What are the variables units of measure? Where does the data come from? When is it updated? How and why was clustering performed a particular way? How and why was a variable adjusted? SEPTEMBER 15,
12 Data Understanding DATA UNDERSTANDING (CONT.) Explore Data Matt s source code: Find the main.r Data Quality DataQualityReport(skus) DataQualityReportOverall(dataSetName=skus) SEPTEMBER 17,
13 Data Preparation DATA PREPARATION Data Description?? Variable Description store_number A unique store identifier sku_number A unique SKU identifier Y SOLD The SKU in a respective store sold (1=yes, 0=no) sold in the last 13 periods after it was replinished/maxied. NUM_SOLD The number of realized unit SKU sales for a respective store over the past 1-13 periods. X NUM_SOLD_LAST The number of realized unit SKU sales for a respective store over the past periods. X application_count The total number of different year-make-model vehicle options that the respective SKU could be used for. X projected_growth_pct The projected percentage growth for this SKU in the next 13 periods based on financial experts. X offset For each store-sku, the positive deviation based on unit sales from the center of the part-type-specific distribution For each store-sku the positive deviation based on unit sales from the center of the part-type-specific distribution adjusted based on an ad-hoc X adjusted_offset calculation X unit_sales_py The total number of units sold for this particular SKU over all stores for between the past 27 and 39 periods. X unit_sales_cy The total number of units sold for this particular SKU over all stores for the past 14 and 26 periods. unit_sales_fy The total number of units sold for this particular SKU over all stores over the past 13 periods. X total_vio The total number of "estimated" vehicles in operations associated to a particular store based on an ad-hoc calculation. X adjusted_total_vio The total number of "estimated" vehicles in operations associated to a particular store based on an ad-hoc calculation. The percentage of vehicles in operations (VIO) for a respective store compared to the total number of VIO for all stores associated to a cluster over X vio_compared_to_cluster the past 14 to 26 periods. X avg_cluster_cy_unit_sales The average number of SKUs sold based on a clustering of all stores over the past 13 to 26 periods. X avg_cluster_cy_total_sales The average number of total sales which is a combination of unit and lost sales based on store clusters over the past 14 to 26 periods. X avg_cluster_cy_lost_sales The average number of lost sales, clustered by all stores over the past 14 to 26 periods. X pop_est_cy Estimated number of persons in the population where the store is located based on the latest period. X pop_density_cy Estimated density (a percentage) of the population where the store is located based on the latest period. X pct_white Estimated number of caucasion-identified persons where the store is located based on the latest period. X age Estimated median person-age where the store is located over based on the latest period. X pct_college Estimated percentage of college-education persons where the store is located based on the latest period. X pct_blue_collar Estimated percentage of blue-collar type workers where the store is located based on the latest period. X median_household_income Estimated median household income where the store is located based on the latest period. Estimated number of physical locations where business is conducted or where services or industrial operations are performed where the store is X establishments located based on the latest period. X road_quality_index A measure of the quality of the roads in the area the store is located. skus = skus[which(complete.cases(skus)),] DataQualityReportOverall(dataSetName=skus) SEPTEMBER 17,
14 Modeling Modeling Techniques C5.0 Decision tree Logistic Regression CART Decision tree MODELING SEPTEMBER 17,
15 Modeling MODELING (CONT.) Design When building and testing predictive models using observational data (i.e. data that is not controlled like in laboratory experimentation), the question that must be answered is how valid is my model in regards to what will happen next? In a properly designed and controlled experiment, data (samples) used in the experiment are used to make inferences about the population. Regardless of how large or small the sample is compared to the true size of the population, this single randomly selected subset of the population allows for generalizability of the remaining subset of data not used in the study. Cross-validation is the most practical and cost effective means of obtaining a proxy for truth in predictive analytics. ## Randomly partition data into training and test sets my_seed = skus = GenerateTTV(dataSetName=skus, response='sold', trainpct=.90, testpct=.10, my_seed) GeneratePartitionPcts(dataSetName=skus) Using the training data error rate as a proxy for a model s generalization error is not wise, especially when the training error is low to almost perfect. Most likely the model has been overfit (or over trained) and will not perform as well when new examples are feed through and evaluated from a validation data set (Zhou, 2012). SEPTEMBER 17,
16 Modeling MODELING (CONT.) Design ## Percentage of Target is 1 (or Y='SOLD') in total data set dim(skus[which(skus$sold==1),])[[1]] / dim(skus[which(skus$sold==0 skus$sold==1),])[[1]] ## Percentage of Target is 1 (or Y='SOLD') in training set dim(skus[which(skus$sold==1 & skus$spss_partition=='train'),])[[1]] / dim(skus[which((skus$sold==1 skus$sold==0) & skus$spss_partition=='train'),])[[1]] ## Percentage of Target is 1 (or Y='SOLD') in test set dim(skus[which(skus$sold==1 & skus$spss_partition=='test'),])[[1]] / dim(skus[which((skus$sold==1 skus$sold==0) & skus$spss_partition=='test'),])[[1]] ## remove independent variables that you don't want to use names(skus) skus2 = skus[,c(3,5:19,21:28)] head(skus2) skus2$sold = as.factor(skus2$sold) SEPTEMBER 17,
17 Modeling MODELING (CONT.) ## set up data for algorithms trainx = skus2[which(skus2$spss_partition=='train'),2:(length(skus2)-1)] trainy = skus2[which(skus2$spss_partition=='train'),'sold'] train = cbind(trainy,trainx) testx = skus2[which(skus2$spss_partition=='test'),2:(length(skus2)-1)] testy = skus2[which(skus2$spss_partition=='test'),'sold'] test = cbind(testy,testx) SEPTEMBER 17,
18 Modeling MODELING (CONT.) Build Models C5.0 Decision tree require(c50) #Fit classification tree models or rule-based models using Quinlan's C5.0 algorithm C5Params = C5.0Control( ) C5 = C5.0(x=trainX, y=trainy,,control=c5params #control parameters defined above,trails=1 ) summary(c5) #number of boosting iterations; 1 implies a single model is used Changing the trials from 1 to 1000 doesn t change the result in this case Overall error rate The confusion matrix is based on a decision cutoff threshold of 0.50 Variables that were used to create the tree SEPTEMBER 17,
19 Modeling MODELING (CONT.) Build Models C5.0 Decision tree ## training probabilities and predicted classes C5trainp = predict(c5,newdata = trainx,trials = C5$trials["Actual"],type = "prob", #either "class" for the predicted class or "prob" for model confidence values.,na.action = na.pass)[,2] C5trainc = predict(c5,newdata = trainx,trials = C5$trials["Actual"],type = "class", #either "class" for the predicted class or "prob" for model confidence values.,na.action = na.pass) ## testing probabilities and predicted classes C5testp = predict(c5,newdata = testx,trials = C5$trials["Actual"],type = "prob", #either "class" for the predicted class or "prob" for model confidence values.,na.action = na.pass)[,2] C5testc = predict(c5,newdata = testx,trials = C5$trials["Actual"],type = "class", #either "class" for the predicted class or "prob" for model confidence values.,na.action = na.pass) SEPTEMBER 17,
20 Modeling MODELING (CONT.) Build Models Logistic Regression logit.fit = glm(trainy ~ NUM_SOLD_LAST+TOTAL_VIO+ADJ_TOTAL_VIO+VIO_COMPARED_TO_CLUSTER+POP_EST_CY+POP_DEN SITY_CY+PCT_WHITE+AGE+PCT_COLLEGE+PCT_BLUE_COLLAR+MEDIAN_HOUSEHOLD_INCOME+ESTABLI SHMENTS+ROAD_QUALITY_INDEX+APPLICATION_COUNT+PROJECTED_GROWTH_PCT+UNIT_SALES_CY +UNIT_SALES_PY+OFFSET+ADJUSTED_OFFSET+AVG_CLUSTER_CY_UNIT_SALES+AVG_CLUSTER_CY_TOTA L_SALES #+AVG_CLUSTER_CY_LOST_SALES,family = binomial,data = train) summary(logit.fit) SEPTEMBER 17,
21 Modeling MODELING (CONT.) Build Models CART require(rpart) set.seed( ) tree = rpart(trainy ~.,data = train,method = "class",cp = 0,minsplit = 4,minbucket = 2,parms = list(prior=c(0.5, 0.5))) #summary(tree) <-this will take awhile ## find the best pruned tree i.min = which.min(tree$cptable[,"xerror"]) i.se = which.min(abs(tree$cptable[,"xerror"] - (tree$cptable[i.min,"xerror"] + tree$cptable[i.min,"xstd"]))) alpha.best = tree$cptable[i.se, "CP"] tree.p = prune(tree, cp=alpha.best) ## obtain predictions treetrainp = predict(tree.p, train)[,2] treetrainc = treetrainp treetrainc[which(treetrainc>.5)] = 1 treetrainc[which(treetrainc<=.5)] = 0 treetestp = predict(tree.p, test)[,2] treetestc = treetestp treetestc[which(treetestc>.5)] = 1 treetestc[which(treetestc<=.5)] = 0 SEPTEMBER 17,
22 Modeling ENSEMBLING VIA BAGGING Bootstrap aggregation (Bagging) Bagging is a simple way to increase the predictive power of a model Pros Useful when the predictors are more unstable, meaning that the more variation observed Cons Using smaller samples will yield more instability and too small yields poor models How Take several random samples with replacement from the training data set Use each sample to construct a separate predictive model with predictions as the testing data set. Average the predictions to come up with one final predicted value my_list = getbs_samples(seed=my_seed, Ntrees=100, SampleSize=1000) bagged_tree = BS_Trees(response=train[,1], datasetname=train, samplelist=my_list, Ntrees=100) bagged_probs = getfinalpredictions(bagged_tree, datasetname=train, Ntrees=100) What R packages are available for bagging? SEPTEMBER 17,
23 Modeling C5.0 MODEL ASSESSMENT - TRAINING Bagged CART Logit CART Why does the bagged tree perform worse? SEPTEMBER 17,
24 Modeling C5.0 MODEL ASSESSMENT - TESTING Bagged CART How do we store a Bagged Model in R? Logit CART SEPTEMBER 17,
25 Modeling OPTIMAL DECISION CUTPOINTS Why use a decision cutoff threshold of 0.50? Example of an ROC curve SEPTEMBER 17,
26 Modeling OPTIMAL DECISION CUTPOINTS See require(optimalcutpoints) ## Define my methods list methodlist = list( "Youden" #1 (Youden Index);,"ROC01" #2 (minimizes distance between ROC plot and point (0,1));,"PROC01" #3 (minimizes distance between PROC plot and point (0,1)); #,"MaxAccuracyArea" #4 (maximizes Accuracy Area); #,"AUC" #5 (maximizes concordance which is a function of AUC);,"MaxEfficiency" #6 (maximizes Efficiency or Accuracy);,"MaxKappa" #7 (maximizes Kappa Index); #,"MinErrorRate" #8 (minimizes Error Rate); C5.0 Training using the Youden cutoff method Using other cutoff methods.. SEPTEMBER 17,
MERGING BUSINESS KPIs WITH PREDICTIVE MODEL KPIs FOR BINARY CLASSIFICATION MODEL SELECTION
MERGING BUSINESS KPIs WITH PREDICTIVE MODEL KPIs FOR BINARY CLASSIFICATION MODEL SELECTION Matthew A. Lanham & Ralph D. Badinelli Virginia Polytechnic Institute and State University Department of Business
More informationHow To Make A Credit Risk Model For A Bank Account
TRANSACTIONAL DATA MINING AT LLOYDS BANKING GROUP Csaba Főző csaba.fozo@lloydsbanking.com 15 October 2015 CONTENTS Introduction 04 Random Forest Methodology 06 Transactional Data Mining Project 17 Conclusions
More informationApplied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets
Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets http://info.salford-systems.com/jsm-2015-ctw August 2015 Salford Systems Course Outline Demonstration of two classification
More informationAzure Machine Learning, SQL Data Mining and R
Azure Machine Learning, SQL Data Mining and R Day-by-day Agenda Prerequisites No formal prerequisites. Basic knowledge of SQL Server Data Tools, Excel and any analytical experience helps. Best of all:
More informationData Mining. Nonlinear Classification
Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Nonlinear Classification Classes may not be separable by a linear boundary Suppose we randomly generate a data set as follows: X has range between 0 to 15
More informationPractical Data Science with Azure Machine Learning, SQL Data Mining, and R
Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Overview This 4-day class is the first of the two data science courses taught by Rafal Lukawiecki. Some of the topics will be
More informationPredictive Data modeling for health care: Comparative performance study of different prediction models
Predictive Data modeling for health care: Comparative performance study of different prediction models Shivanand Hiremath hiremat.nitie@gmail.com National Institute of Industrial Engineering (NITIE) Vihar
More informationTHE THREE "Rs" OF PREDICTIVE ANALYTICS
THE THREE "Rs" OF PREDICTIVE As companies commit to big data and data-driven decision making, the demand for predictive analytics has never been greater. While each day seems to bring another story of
More informationBetter credit models benefit us all
Better credit models benefit us all Agenda Credit Scoring - Overview Random Forest - Overview Random Forest outperform logistic regression for credit scoring out of the box Interaction term hypothesis
More informationANALYTICS CENTER LEARNING PROGRAM
Overview of Curriculum ANALYTICS CENTER LEARNING PROGRAM The following courses are offered by Analytics Center as part of its learning program: Course Duration Prerequisites 1- Math and Theory 101 - Fundamentals
More informationLavastorm Analytic Library Predictive and Statistical Analytics Node Pack FAQs
1.1 Introduction Lavastorm Analytic Library Predictive and Statistical Analytics Node Pack FAQs For brevity, the Lavastorm Analytics Library (LAL) Predictive and Statistical Analytics Node Pack will be
More informationIn this presentation, you will be introduced to data mining and the relationship with meaningful use.
In this presentation, you will be introduced to data mining and the relationship with meaningful use. Data mining refers to the art and science of intelligent data analysis. It is the application of machine
More informationWelcome. Data Mining: Updates in Technologies. Xindong Wu. Colorado School of Mines Golden, Colorado 80401, USA
Welcome Xindong Wu Data Mining: Updates in Technologies Dept of Math and Computer Science Colorado School of Mines Golden, Colorado 80401, USA Email: xwu@ mines.edu Home Page: http://kais.mines.edu/~xwu/
More informationGerry Hobbs, Department of Statistics, West Virginia University
Decision Trees as a Predictive Modeling Method Gerry Hobbs, Department of Statistics, West Virginia University Abstract Predictive modeling has become an important area of interest in tasks such as credit
More informationData Mining - Evaluation of Classifiers
Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010
More informationOverview, Goals, & Introductions
Improving the Retail Experience with Predictive Analytics www.spss.com/perspectives Overview, Goals, & Introductions Goal: To present the Retail Business Maturity Model Equip you with a plan of attack
More informationKnowledge Discovery and Data Mining
Knowledge Discovery and Data Mining Unit # 11 Sajjad Haider Fall 2013 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right
More informationData Mining Practical Machine Learning Tools and Techniques
Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea
More informationData Mining Methods: Applications for Institutional Research
Data Mining Methods: Applications for Institutional Research Nora Galambos, PhD Office of Institutional Research, Planning & Effectiveness Stony Brook University NEAIR Annual Conference Philadelphia 2014
More informationAn Overview of Predictive Analytics for Practitioners. Dean Abbott, Abbott Analytics
An Overview of Predictive Analytics for Practitioners Dean Abbott, Abbott Analytics Thank You Sponsors Empower users with new insights through familiar tools while balancing the need for IT to monitor
More informationnot possible or was possible at a high cost for collecting the data.
Data Mining and Knowledge Discovery Generating knowledge from data Knowledge Discovery Data Mining White Paper Organizations collect a vast amount of data in the process of carrying out their day-to-day
More informationFeature vs. Classifier Fusion for Predictive Data Mining a Case Study in Pesticide Classification
Feature vs. Classifier Fusion for Predictive Data Mining a Case Study in Pesticide Classification Henrik Boström School of Humanities and Informatics University of Skövde P.O. Box 408, SE-541 28 Skövde
More informationIBM SPSS Direct Marketing 22
IBM SPSS Direct Marketing 22 Note Before using this information and the product it supports, read the information in Notices on page 25. Product Information This edition applies to version 22, release
More informationKnowledgeSTUDIO HIGH-PERFORMANCE PREDICTIVE ANALYTICS USING ADVANCED MODELING TECHNIQUES
HIGH-PERFORMANCE PREDICTIVE ANALYTICS USING ADVANCED MODELING TECHNIQUES Translating data into business value requires the right data mining and modeling techniques which uncover important patterns within
More informationNumerical Algorithms Group
Title: Summary: Using the Component Approach to Craft Customized Data Mining Solutions One definition of data mining is the non-trivial extraction of implicit, previously unknown and potentially useful
More informationKnowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes
Knowledge Discovery and Data Mining Lecture 19 - Bagging Tom Kelsey School of Computer Science University of St Andrews http://tom.host.cs.st-andrews.ac.uk twk@st-andrews.ac.uk Tom Kelsey ID5059-19-B &
More informationCI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.
CI6227: Data Mining Lesson 11b: Ensemble Learning Sinno Jialin PAN Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore Acknowledgements: slides are adapted from the lecture notes
More informationA Study Of Bagging And Boosting Approaches To Develop Meta-Classifier
A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier G.T. Prasanna Kumari Associate Professor, Dept of Computer Science and Engineering, Gokula Krishna College of Engg, Sullurpet-524121,
More informationKnowledge Discovery and Data Mining
Knowledge Discovery and Data Mining Unit # 10 Sajjad Haider Fall 2012 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right
More informationIBM SPSS Direct Marketing 23
IBM SPSS Direct Marketing 23 Note Before using this information and the product it supports, read the information in Notices on page 25. Product Information This edition applies to version 23, release
More informationDecision Support Optimization through Predictive Analytics - Leuven Statistical Day 2010
Decision Support Optimization through Predictive Analytics - Leuven Statistical Day 2010 Ernst van Waning Senior Sales Engineer May 28, 2010 Agenda SPSS, an IBM Company SPSS Statistics User-driven product
More informationUsing multiple models: Bagging, Boosting, Ensembles, Forests
Using multiple models: Bagging, Boosting, Ensembles, Forests Bagging Combining predictions from multiple models Different models obtained from bootstrap samples of training data Average predictions or
More informationChapter 6. The stacking ensemble approach
82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described
More informationClassification and Regression by randomforest
Vol. 2/3, December 02 18 Classification and Regression by randomforest Andy Liaw and Matthew Wiener Introduction Recently there has been a lot of interest in ensemble learning methods that generate many
More informationThe Scientific Data Mining Process
Chapter 4 The Scientific Data Mining Process When I use a word, Humpty Dumpty said, in rather a scornful tone, it means just what I choose it to mean neither more nor less. Lewis Carroll [87, p. 214] In
More informationHexaware E-book on Predictive Analytics
Hexaware E-book on Predictive Analytics Business Intelligence & Analytics Actionable Intelligence Enabled Published on : Feb 7, 2012 Hexaware E-book on Predictive Analytics What is Data mining? Data mining,
More informationNine Common Types of Data Mining Techniques Used in Predictive Analytics
1 Nine Common Types of Data Mining Techniques Used in Predictive Analytics By Laura Patterson, President, VisionEdge Marketing Predictive analytics enable you to develop mathematical models to help better
More informationWhy do statisticians "hate" us?
Why do statisticians "hate" us? David Hand, Heikki Mannila, Padhraic Smyth "Data mining is the analysis of (often large) observational data sets to find unsuspected relationships and to summarize the data
More informationRevoScaleR Speed and Scalability
EXECUTIVE WHITE PAPER RevoScaleR Speed and Scalability By Lee Edlefsen Ph.D., Chief Scientist, Revolution Analytics Abstract RevoScaleR, the Big Data predictive analytics library included with Revolution
More informationAn Overview of Data Mining: Predictive Modeling for IR in the 21 st Century
An Overview of Data Mining: Predictive Modeling for IR in the 21 st Century Nora Galambos, PhD Senior Data Scientist Office of Institutional Research, Planning & Effectiveness Stony Brook University AIRPO
More informationNew Work Item for ISO 3534-5 Predictive Analytics (Initial Notes and Thoughts) Introduction
Introduction New Work Item for ISO 3534-5 Predictive Analytics (Initial Notes and Thoughts) Predictive analytics encompasses the body of statistical knowledge supporting the analysis of massive data sets.
More informationAdvanced analytics at your hands
2.3 Advanced analytics at your hands Neural Designer is the most powerful predictive analytics software. It uses innovative neural networks techniques to provide data scientists with results in a way previously
More informationDidacticiel Études de cas
1 Theme Data Mining with R The rattle package. R (http://www.r project.org/) is one of the most exciting free data mining software projects of these last years. Its popularity is completely justified (see
More informationSURVEY REPORT DATA SCIENCE SOCIETY 2014
SURVEY REPORT DATA SCIENCE SOCIETY 2014 TABLE OF CONTENTS Contents About the Initiative 1 Report Summary 2 Participants Info 3 Participants Expertise 6 Suggested Discussion Topics 7 Selected Responses
More informationUsing Predictive Analytics to Detect Contract Fraud, Waste, and Abuse Case Study from U.S. Postal Service OIG
Using Predictive Analytics to Detect Contract Fraud, Waste, and Abuse Case Study from U.S. Postal Service OIG MACPA Government & Non Profit Conference April 26, 2013 Isaiah Goodall, Director of Business
More informationDistributed forests for MapReduce-based machine learning
Distributed forests for MapReduce-based machine learning Ryoji Wakayama, Ryuei Murata, Akisato Kimura, Takayoshi Yamashita, Yuji Yamauchi, Hironobu Fujiyoshi Chubu University, Japan. NTT Communication
More informationHow To Perform An Ensemble Analysis
Charu C. Aggarwal IBM T J Watson Research Center Yorktown, NY 10598 Outlier Ensembles Keynote, Outlier Detection and Description Workshop, 2013 Based on the ACM SIGKDD Explorations Position Paper: Outlier
More informationData Mining Applications in Higher Education
Executive report Data Mining Applications in Higher Education Jing Luan, PhD Chief Planning and Research Officer, Cabrillo College Founder, Knowledge Discovery Laboratories Table of contents Introduction..............................................................2
More informationIBM SPSS Modeler Professional
IBM SPSS Modeler Professional Make better decisions through predictive intelligence Highlights Create more effective strategies by evaluating trends and likely outcomes. Easily access, prepare and model
More informationA Property & Casualty Insurance Predictive Modeling Process in SAS
Paper AA-02-2015 A Property & Casualty Insurance Predictive Modeling Process in SAS 1.0 ABSTRACT Mei Najim, Sedgwick Claim Management Services, Chicago, Illinois Predictive analytics has been developing
More informationCOPYRIGHTED MATERIAL. Contents. List of Figures. Acknowledgments
Contents List of Figures Foreword Preface xxv xxiii xv Acknowledgments xxix Chapter 1 Fraud: Detection, Prevention, and Analytics! 1 Introduction 2 Fraud! 2 Fraud Detection and Prevention 10 Big Data for
More informationHow To Choose A Churn Prediction
Assessing classification methods for churn prediction by composite indicators M. Clemente*, V. Giner-Bosch, S. San Matías Department of Applied Statistics, Operations Research and Quality, Universitat
More informationPerformance Metrics for Graph Mining Tasks
Performance Metrics for Graph Mining Tasks 1 Outline Introduction to Performance Metrics Supervised Learning Performance Metrics Unsupervised Learning Performance Metrics Optimizing Metrics Statistical
More informationCRISP - DM. Data Mining Process. Process Standardization. Why Should There be a Standard Process? Cross-Industry Standard Process for Data Mining
Mining Process CRISP - DM Cross-Industry Standard Process for Mining (CRISP-DM) European Community funded effort to develop framework for data mining tasks Goals: Cross-Industry Standard Process for Mining
More informationNumerical Algorithms Group. Embedded Analytics. A cure for the common code. www.nag.com. Results Matter. Trust NAG.
Embedded Analytics A cure for the common code www.nag.com Results Matter. Trust NAG. Executive Summary How much information is there in your data? How much is hidden from you, because you don t have access
More informationIndian School of Business Forecasting Sales for Dairy Products
Indian School of Business Forecasting Sales for Dairy Products Contents EXECUTIVE SUMMARY... 3 Data Analysis... 3 Forecast Horizon:... 4 Forecasting Models:... 4 Fresh milk - AmulTaaza (500 ml)... 4 Dahi/
More informationChapter 12 Bagging and Random Forests
Chapter 12 Bagging and Random Forests Xiaogang Su Department of Statistics and Actuarial Science University of Central Florida - 1 - Outline A brief introduction to the bootstrap Bagging: basic concepts
More informationDEMYSTIFYING BIG DATA. What it is, what it isn t, and what it can do for you.
DEMYSTIFYING BIG DATA What it is, what it isn t, and what it can do for you. JAMES LUCK BIO James Luck is a Data Scientist with AT&T Consulting. He has 25+ years of experience in data analytics, in addition
More informationLeveraging Ensemble Models in SAS Enterprise Miner
ABSTRACT Paper SAS133-2014 Leveraging Ensemble Models in SAS Enterprise Miner Miguel Maldonado, Jared Dean, Wendy Czika, and Susan Haller SAS Institute Inc. Ensemble models combine two or more models to
More informationEnhancing Compliance with Predictive Analytics
Enhancing Compliance with Predictive Analytics FTA 2007 Revenue Estimation and Research Conference Reid Linn Tennessee Department of Revenue reid.linn@state.tn.us Sifting through a Gold Mine of Tax Data
More informationLecture 10: Regression Trees
Lecture 10: Regression Trees 36-350: Data Mining October 11, 2006 Reading: Textbook, sections 5.2 and 10.5. The next three lectures are going to be about a particular kind of nonlinear predictive model,
More informationThe Predictive Data Mining Revolution in Scorecards:
January 13, 2013 StatSoft White Paper The Predictive Data Mining Revolution in Scorecards: Accurate Risk Scoring via Ensemble Models Summary Predictive modeling methods, based on machine learning algorithms
More informationPredictive Analytics Techniques: What to Use For Your Big Data. March 26, 2014 Fern Halper, PhD
Predictive Analytics Techniques: What to Use For Your Big Data March 26, 2014 Fern Halper, PhD Presenter Proven Performance Since 1995 TDWI helps business and IT professionals gain insight about data warehousing,
More informationService courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics.
Course Catalog In order to be assured that all prerequisites are met, students must acquire a permission number from the education coordinator prior to enrolling in any Biostatistics course. Courses are
More informationGrow Revenues and Reduce Risk with Powerful Analytics Software
Grow Revenues and Reduce Risk with Powerful Analytics Software Overview Gaining knowledge through data selection, data exploration, model creation and predictive action is the key to increasing revenues,
More informationThe Prophecy-Prototype of Prediction modeling tool
The Prophecy-Prototype of Prediction modeling tool Ms. Ashwini Dalvi 1, Ms. Dhvni K.Shah 2, Ms. Rujul B.Desai 3, Ms. Shraddha M.Vora 4, Mr. Vaibhav G.Tailor 5 Department of Information Technology, Mumbai
More informationDecision Trees from large Databases: SLIQ
Decision Trees from large Databases: SLIQ C4.5 often iterates over the training set How often? If the training set does not fit into main memory, swapping makes C4.5 unpractical! SLIQ: Sort the values
More informationMake Better Decisions Through Predictive Intelligence
IBM SPSS Modeler Professional Make Better Decisions Through Predictive Intelligence Highlights Easily access, prepare and model structured data with this intuitive, visual data mining workbench Rapidly
More informationFrom Raw Data to. Actionable Insights with. MATLAB Analytics. Learn more. Develop predictive models. 1Access and explore data
100 001 010 111 From Raw Data to 10011100 Actionable Insights with 00100111 MATLAB Analytics 01011100 11100001 1 Access and Explore Data For scientists the problem is not a lack of available but a deluge.
More informationKnowledgeSEEKER POWERFUL SEGMENTATION, STRATEGY DESIGN AND VISUALIZATION SOFTWARE
POWERFUL SEGMENTATION, STRATEGY DESIGN AND VISUALIZATION SOFTWARE Most Effective Modeling Application Designed to Address Business Challenges Applying a predictive strategy to reach a desired business
More informationDynamic Predictive Modeling in Claims Management - Is it a Game Changer?
Dynamic Predictive Modeling in Claims Management - Is it a Game Changer? Anil Joshi Alan Josefsek Bob Mattison Anil Joshi is the President and CEO of AnalyticsPlus, Inc. (www.analyticsplus.com)- a Chicago
More informationAdvanced Big Data Analytics with R and Hadoop
REVOLUTION ANALYTICS WHITE PAPER Advanced Big Data Analytics with R and Hadoop 'Big Data' Analytics as a Competitive Advantage Big Analytics delivers competitive advantage in two ways compared to the traditional
More informationData Project Extract Big Data Analytics course. Toulouse Business School London 2015
Data Project Extract Big Data Analytics course Toulouse Business School London 2015 How do you analyse data? Project are often a flop: Need a problem, a business problem to solve. Start with a small well-defined
More information!"!!"#$$%&'()*+$(,%!"#$%$&'()*""%(+,'-*&./#-$&'(-&(0*".$#-$1"(2&."3$'45"
!"!!"#$$%&'()*+$(,%!"#$%$&'()*""%(+,'-*&./#-$&'(-&(0*".$#-$1"(2&."3$'45"!"#"$%&#'()*+',$$-.&#',/"-0%.12'32./4'5,5'6/%&)$).2&'7./&)8'5,5'9/2%.%3%&8':")08';:
More informationBIDM Project. Predicting the contract type for IT/ITES outsourcing contracts
BIDM Project Predicting the contract type for IT/ITES outsourcing contracts N a n d i n i G o v i n d a r a j a n ( 6 1 2 1 0 5 5 6 ) The authors believe that data modelling can be used to predict if an
More informationUsing Data Mining to Detect Insurance Fraud
IBM SPSS Modeler Using Data Mining to Detect Insurance Fraud Improve accuracy and minimize loss Highlights: combines powerful analytical techniques with existing fraud detection and prevention efforts
More informationSoftware for Supply Chain Design and Analysis
Software for Supply Chain Design and Analysis Optimize networks Improve product flow Position inventory Simulate service Balance production Refine routes The Leading Supply Chain Design and Analysis Application
More informationUsing Data Mining to Detect Insurance Fraud
IBM SPSS Modeler Using Data Mining to Detect Insurance Fraud Improve accuracy and minimize loss Highlights: Combine powerful analytical techniques with existing fraud detection and prevention efforts Build
More informationDiscovering, Not Finding. Practical Data Mining for Practitioners: Level II. Advanced Data Mining for Researchers : Level III
www.cognitro.com/training Predicitve DATA EMPOWERING DECISIONS Data Mining & Predicitve Training (DMPA) is a set of multi-level intensive courses and workshops developed by Cognitro team. it is designed
More informationFinancial Statement Fraud Detection: An Analysis of Statistical and Machine Learning Algorithms
Financial Statement Fraud Detection: An Analysis of Statistical and Machine Learning Algorithms Johan Perols Assistant Professor University of San Diego, San Diego, CA 92110 jperols@sandiego.edu April
More informationPrediction of Stock Performance Using Analytical Techniques
136 JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 5, NO. 2, MAY 2013 Prediction of Stock Performance Using Analytical Techniques Carol Hargreaves Institute of Systems Science National University
More informationStatement of Work. Shin Woong Sung
Statement of Work Shin Woong Sung 1. Executive Summary This Statement of Work (SOW) suggests a plan and a solution approach to find out the best mix of machines for each casino site of Lucky Duck Entertainment,
More informationKnowledge Discovery and Data Mining
Knowledge Discovery and Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Evaluating the Accuracy of a Classifier Holdout, random subsampling, crossvalidation, and the bootstrap are common techniques for
More informationData Science with R. Introducing Data Mining with Rattle and R. Graham.Williams@togaware.com
http: // togaware. com Copyright 2013, Graham.Williams@togaware.com 1/35 Data Science with R Introducing Data Mining with Rattle and R Graham.Williams@togaware.com Senior Director and Chief Data Miner,
More informationExecutive Briefing White Paper Plant Performance Predictive Analytics
Executive Briefing White Paper Plant Performance Predictive Analytics A Data Mining Based Approach Abstract The data mining buzzword has been floating around the process industries offices and control
More informationInsurance Analytics - analýza dat a prediktivní modelování v pojišťovnictví. Pavel Kříž. Seminář z aktuárských věd MFF 4.
Insurance Analytics - analýza dat a prediktivní modelování v pojišťovnictví Pavel Kříž Seminář z aktuárských věd MFF 4. dubna 2014 Summary 1. Application areas of Insurance Analytics 2. Insurance Analytics
More informationOverview. Background. Data Mining Analytics for Business Intelligence and Decision Support
Mining Analytics for Business Intelligence and Decision Support Chid Apte, PhD Manager, Abstraction Research Group IBM TJ Watson Research Center apte@us.ibm.com http://www.research.ibm.com/dar Overview
More informationStrengthening Diverse Retail Business Processes with Forecasting: Practical Application of Forecasting Across the Retail Enterprise
Paper SAS1833-2015 Strengthening Diverse Retail Business Processes with Forecasting: Practical Application of Forecasting Across the Retail Enterprise Alex Chien, Beth Cubbage, Wanda Shive, SAS Institute
More informationTable of Contents. June 2010
June 2010 From: StatSoft Analytics White Papers To: Internal release Re: Performance comparison of STATISTICA Version 9 on multi-core 64-bit machines with current 64-bit releases of SAS (Version 9.2) and
More informationVariable Selection in the Credit Card Industry Moez Hababou, Alec Y. Cheng, and Ray Falk, Royal Bank of Scotland, Bridgeport, CT
Variable Selection in the Credit Card Industry Moez Hababou, Alec Y. Cheng, and Ray Falk, Royal ank of Scotland, ridgeport, CT ASTRACT The credit card industry is particular in its need for a wide variety
More informationPredictive Modeling Techniques in Insurance
Predictive Modeling Techniques in Insurance Tuesday May 5, 2015 JF. Breton Application Engineer 2014 The MathWorks, Inc. 1 Opening Presenter: JF. Breton: 13 years of experience in predictive analytics
More informationAnalytics in Action. What do Jeopardy, Pampers, and Major League Baseball all have in common? October 24, 2012
Analytics in Action What do Jeopardy, Pampers, and Major League Baseball all have in common? October 24, 2012 University of Cincinnati Tangeman University Center Theater Sponsored by LUCRUM, Inc. ABOUT
More informationMarketing Strategies for Retail Customers Based on Predictive Behavior Models
Marketing Strategies for Retail Customers Based on Predictive Behavior Models Glenn Hofmann HSBC Salford Systems Data Mining 2005 New York, March 28 30 0 Objectives Inform about effective approach to direct
More informationTrusted Experts in Business Analytics BUSINESS ANALYTICS FOR DEMAND PLANNING: HOW TO FORECAST STORE/SKU DEMAND
Trusted Experts in Business Analytics BUSINESS ANALYTICS FOR DEMAND PLANNING: HOW TO FORECAST STORE/SKU DEMAND September 2014 HOW DOES TM1 AND SPSS MODELER INTEGRATION WORK? In a QueBIT whitepaper titled
More informationData Mining Part 5. Prediction
Data Mining Part 5. Prediction 5.1 Spring 2010 Instructor: Dr. Masoud Yaghini Outline Classification vs. Numeric Prediction Prediction Process Data Preparation Comparing Prediction Methods References Classification
More informationPrescriptive Analytics. A business guide
Prescriptive Analytics A business guide May 2014 Contents 3 The Business Value of Prescriptive Analytics 4 What is Prescriptive Analytics? 6 Prescriptive Analytics Methods 7 Integration 8 Business Applications
More informationAPPLICATION PROGRAMMING: DATA MINING AND DATA WAREHOUSING
Wrocław University of Technology Internet Engineering Henryk Maciejewski APPLICATION PROGRAMMING: DATA MINING AND DATA WAREHOUSING PRACTICAL GUIDE Wrocław (2011) 1 Copyright by Wrocław University of Technology
More informationData Mining Introduction
Data Mining Introduction Bob Stine Dept of Statistics, School University of Pennsylvania www-stat.wharton.upenn.edu/~stine What is data mining? An insult? Predictive modeling Large, wide data sets, often
More information