Data Mining Algorithms Part 1 Dejan Sarka Join the conversation on Twitter: @DevWeek #DW2015
Instructor Bio Dejan Sarka (dsarka@solidq.com) 30 years of experience SQL Server MVP, MCT, 13 books 7+ courses Focus: Data modeling Data mining Data quality
Agenda Introduction Naïve Bayes Decision Trees Neural Network Logistic Regression Predictive Models Evaluation
Data Mining Algorithms Data mining as the most advanced data analysis technique is gaining popularity With modern data mining engines, products and packages, like SQL Server Analysis Services (SSAS), Excel and R, data mining has become a black box It is possible to use data mining without knowing how it works But: not knowing how the algorithms work might lead to many problems, including using the wrong algorithm for a task, misinterpretation of the results, and more Learn how the most popular data mining algorithms work When to use which algorithm Advantages and drawbacks of each algorithm Use the algorithms SSAS, Excel and R
Assumptions This is not an introduction to SQL Server, Excel or R Neither to the tools like Visual Studio, SQL Server Management Studio, Excel, or RStudio Basic familiarity with at least one of the tools assumed Focus is on the algorithms
What Is Data Mining? Michael J. A. Berry and Gordon S. Linoff: Data mining is the process of exploration and analysis, by automatic or semiautomatic means, of large quantities of data in order to discover patterns and rules Ralph Kimball: Data mining is a collection of powerful analysis techniques for making sense out of very large datasets Bill Inmon: Data Mining / Data Exploration is the usage of historical data to discover and exploit important business relationships
What Is Data Mining? Deduce knowledge by examining data and then make predictions on the knowledge extracted Examining data means scanning samples of known facts about cases using their attributes, which are called variables Knowledge is the patterns, clusters, decision trees, neural networks, association rules On-Line Analytical Processing (OLAP) is model driven, whereas data mining is data driven Alternative names include knowledge discovery in databases (KDD) and predictive analytics
The Two Types of Data Mining Directed (supervised) data mining (top-down approach) Classification Estimation Forecasting Undirected (unsupervised) data mining (bottomup approach) Affinity grouping Clustering Description
Typical Business Questions What s the credit risk of this customer? Are there any groups of my customers? What products do customers tend to buy together? How much of a specific product can I sell in the next time period? What is the potential number of customers shopping in this store? What are the major groups of my web-click customers? Is this a spam email?
Data Mining Tasks Cross-selling market basket analysis Order of items in a purchase might also be of some interest Fraud detection Churn detection Customer segmentation How is a website is used Forecasting
Data Mining Virtuous Cycle Identify Transform Measure Act
The CRISP Model Transform CRISP = Cross Industry Standard Process for Data Mining (http://en.wikipedia.org/wiki/cross_industry_standard_process_for_data_mining)
Data Mining Data Flow Model Browsing LOB Apps Historical Dataset ETL Reports Mining Models Prediction Cube Cube New Dataset
Different Types of Analyses Structured reports Some interaction, but not dynamic restructuring Can enable ad-hoc reports with a semantic model Structured groupings in OLAP Predefined grouping buckets Report structure is dynamic Structured attributes with data mining Predefined attributes Mining model calculates grouping and structure
SQL Server Tools SQL Server Analysis Services (SSAS) installed in Multidimensional and Data Mining mode SQL Server Integration Services (SSIS) Full-text search and semantic search
Excel Tools Microsoft Office Data Mining Add-ins Excel does not become a data mining engine Needs connection to SSAS in multidimensional mode Excel cell range or Excel table can be the data source Three add-ins: Data Mining Client for Excel Table Analysis Tools for Excel Data Mining Templates for Visio
Introducing R R is a free programming language and software environment for statistical computing and graphics Free under the GNU General Public License Pre-compiled binary versions are provided for various operating systems R uses a command line interface; however, several graphical user interfaces are available for use with R RStudio is a free and open source integrated development environment (IDE) for R
Naïve Bayes Naive Bayes quickly builds mining models that can be used for classification and prediction This makes the model a good option for exploring the data It calculates probabilities for each possible state of the input attribute, given each state of the predictable attribute The probabilities can later be used to predict an outcome of the predicted attribute based on the known input attributes Input attributes are treated as mutually independent
Naïve Bayes OK Faulty.30.90.10.27 Judged Faulty:.41.03.70.20.80 Judged OK:.59.14 Actual Actual declared.56
Naïve Bayes OK Faulty.67 Reverse Tree.41.59.33.05.95 After classification, the posterior probabilities are much more accurate than the prior probabilities Declared (prior) Actual (posterior) probabilities
Example Table of products with Color, Class, and Weight columns If Color is missing, 80% of Weight values are missing as well; if Class is missing, 60% of Weight values are missing as well 0.8 (Color missing for Weight missing) * 0.6 (Class missing for Weight missing) = 0.48 0.2 (Color missing for Weight not missing) * 0.4 (Class missing for Weight not missing) = 0.08 The likelihood that Weight is missing is much higher than the likelihood it is not missing when Color and Class are unknown
Example You can convert the likelihoods to probabilities by normalizing their sum to 1: P (Weight missing if Color and Class are missing) = 0.48 / (0.48 + 0.08) = 0.857 P (Weight not missing if Color and Class are missing) = 0.08 / (0.48 + 0.08) = 0.143
Naïve Bayes Usage Naive Bayes is used for classification Assign new cases to predefined classes Typical usage scenarios include: Categorizing bank loan applications Assigning customers to predefined segments Quickly obtaining a basic comprehension of the data by checking the correlation between input variables
Decision Trees Decision Trees assign (classify) each case to one of a few (discrete) broad categories of selected attribute (variable) and explains the classification with few selected input variables Once built, they are easy to understand They are used to predict values of the explained variable Recursive partitioning is used to build the tree Data is split into partitions and then split up more Initially all cases are in one big box
Decision Trees The algorithm tries all possible breaks in classes using all possible values of each input attribute; it then selects the split that partitions the data to the purest classes of the searched variable Uses several measures of purity, such as frequency distribution, entropy, and Bayesian scoring of prior / posterior probabilities It then repeats the splitting process for each new class, again testing all possible breaks The problem is where to stop
Decision Trees A common problem is over-fitting Not useful branches of the tree can be prepruned or post-pruned Pre-pruning methods try to stunt the growth of the tree before it grows too deep They test each node to see whether a further split would be useful; the tests can be simple (n of cases) or complicated (complexity penalty) Post-pruning methods allow the tree to grow and then prune off branches Again the test can be simple (n of cases) or more complex
Example Interview of the people who watched the famous Woodstock movie You have a population aged between 20 and 60 years old You gathered data about their education and ranged it into 7 classes (1 = lowest, 7 = highest) 55% of them liked the movie Can you discover the factors that have an influence on whether they liked the movie?
Example E D U C A T I O N L E V E L 60 55 50 Liked 55% Y 45% N 45 A G E 40 35 Y E A R S 1 2 3 4 5 6 7 30 25 20
Example E D U C A T I O N L E V E L 60 55 Liked 55% Y 45% N A G E 50 45 40 35 35+ AGE 35- Liked 73% Y 27% N Liked 33% Y 67% N Y E A R S 1 2 3 4 5 6 7 30 25 20
Example E D U C A T I O N L E V E L Liked 55% Y 45% N 35+ AGE 35- A G E 35 Liked 73% Y 27% N Liked 33% Y 67% N EDUCATION 2+ 2-5- 5+ Y E A R S 2 5 Liked 87% Y 13% N Liked 33% Y 67% N Liked 17% Y 83% N Liked 67% Y 33% N
Decision Trees Usage Decision Trees are used for classification and prediction Typical usage scenarios include: Predicting which customers will leave Targeting the audience for mailings and promotional campaigns Explaining the reasons for a decision Answering questions such as What movies do young female customers buy?
Neural Network A neural network is a data modeling tool that can capture and represent complex input/output relationships Neural networks resemble the human brain in the following two ways: It acquires knowledge through learning It s knowledge is stored within inter-neuron connection strengths known as synaptic weights The Neural Network algorithm explores more possible data relationships than the other algorithms
Neural Network Hidden Layer Output Non-Linear Function * Input Unit * Hyperbolic tangent function in hidden layer and sigmoid function in output layer Weighted Sum
Backpropagation Training a neural network is the process of setting the best weights on the inputs of each of the units The backpropagation process: Gets a training example and calculates outputs Calculate the errors the difference between the calculated and the expected (known) result Adjusts the weights to minimize the error
Logistic Regression Sigmoid function is called the logistic function as well If a neural network has only input neurons that are directly connected to the output neurons, it is logistic regression No hidden layer f x = tanh x = ex e x e x + e x 1 g x = σ x = 1 + e x
Logistic Regression and Neural Network Usage Like the Decision Trees algorithm, you can use the Neural Network and Logistic Regression algorithms for Classification Prediction E.g., risk analysis Interpretation is more complex Especially neural networks with many hidden layers Decision Trees algorithm is more popular
Evaluating Predictive Models Lift chart Profit chart Classification matrix Cross validation
Training and Test Sets For predictive models, you need to split the data into training and test sets in order to evaluate the models A training set is required to build the model (70% of the data) A test set is used for predictions (30% of the data) When you know the value of the predicted variable, you can measure the quality of the predictions As with every sampling, it is important to randomly select the data for each set
Lift Chart No target value: overall performance Target value: a percentage of the target audience against a specified percentage of the complete audience
Profit Chart Y = profit X = percentage of the population contacted Settings: Population Fixed Cost Individual Cost Revenue Per Individual
Classification Matrix Predicted Negative Positive Actual Negative A B Positive C D The accuracy (AC) is the proportion of the total number of predictions that were correct The recall or true positive rate (TP) is the proportion of positive cases that were correctly identified The false positive rate (FP) is the proportion of negative cases that were incorrectly classified as positive A C A D B A D B C D B D
Classification Matrix Predicted Negative Positive Actual Negative A B Positive C D The true negative rate (TN) is the proportion of negative cases that were classified correctly The false negative rate (FN) is the proportion of positive cases that were incorrectly classified as negative The precision (P) is the proportion of the predicted positive cases that were correct A C B A C D B D D
Cross Validation Cross validation shows the robustness of the models Splits training set in folds Use one fold for testing, others for training You can see how models perform over different subsets of data
Questions? Thank you! Join the conversation on Twitter: @DevWeek #DW2015