Working with Multidimensional Cubes in. SQL Server Data Mining

Transcription

1 Working with Multidimensional Cubes in SQL Server Data Mining MIS 5346 Foundations of Data Warehousing G. Green Student Notes 4/15/ :31 AM Page 1 of 38

2 Building a Multidimensional Cube Last Updated Data Mining... 3 Example: Increasing Student Support Decision Tree, First Attempt... 6 Problem Definition... 6 Data Preparation... 7 Model Development/Training... 9 Example: Increase Student Support Decision Tree Problem Definition Data Preparation Model Development/Training Model Validation/Evaluation Prepare Test Data Lift Chart with No Predict Value to Validate Model Lift Chart with Predict Value to Validate Model Classification Matrix Model Deployment/Use Singleton Query Prediction Join Query Example: Increase Student Support Clustering Problem Definition Data Preparation Model Development/Training Model Validation/Evaluation Model Deployment/Use Data Mining with Excel Example: Course Mixture -- Association Data Preparation Configure a connection to SSAS Import data Explore Data Clean/Transform Data Model Development/Training Model Validation/Evaluation Model Deployment/Use Student Notes 4/15/ :31 AM Page 2 of 38

3 Data Mining Data Mining is a set of techniques for exploring large amounts of data to find patterns and perform predictions. We use results to help improve the organization s performance. Data warehouses can assist in decision-making to improve organizational performance. For example, an OLTP or data warehouse can easily tell us How many of productx were sold last month? An OLAP cube can easily tell us What is the difference in product sales of productx over the last 5 years by region? However analyses such as Which consumers should be targeted for future sales of productx? would be more easily addressed by data mining analyses. Other examples of analyses that data mining can assist in includes why did a certain political candidate win/lose an election; will this particular customer buy a product or not; which of my loyalty card members are most likely to buy productx; predict the utilization of my hospital beds over the next month; based on my sales, how do I need to staff; etc Data mining and data warehouses complement each other well. Data warehouses provide historical data which has been integrated and cleansed; data mining helps identify what data is more meaningful for decision-making and may therefore warrant further attention in the warehouse. 6 Broad Categories of Data Mining Tasks and Related Algorithms 1. Classification Prediction of a discrete attribute (an attribute with distinct values like yes/no, good/bad, h/m/l likely/unlikely, ) based on the other column/attribute values in the case Decision Tree Naïve Bayes Neural Net 2. Estimation/Regression/Forecasting Prediction of a continuous attribute (like sales, probability, ) based on the other column/attribute values in the case Linear Regression Logistical Regression Time Series 3. Association AKA market basket analysis; finds which cases belong together in a group; requires having grouping of data from the past Association 4. Segmentation/Clustering AKA market segmentation; Grouping data into categories based on shared/similar attribute values Clustering 5. Sequence Analysis Examines the ordering of events over time to predict future sequencing. Eg, what products or services a customer will need next? How should our TV programs be sequenced? Clustering 6. Deviation Analysis AKA fraud detection; Finding exceptions or outliers in the data Clustering Student Notes 4/15/ :31 AM Page 3 of 38

4 We will focus on three algorithms: 1. Decision Tree a. Tree-like model of decisions and consequences/outcomes. b. Good for predicting c. Non-parametric so no specific data distribution is needed/assumed d. Supervised so you specify: i. A key column unique to each case/row ii. One or more input variables iii. the variable you re trying to predict/target e. Can handle unbalanced datasets (i.e. datasets with large numbers of positive or negative targets, and small numbers of the opposite target type) f. Very simple to understand 2. Clustering a. Grouping similar objects together b. Good for exploring data, gaining insights into cluster characteristics c. Unsupervised so no target variable needed. But do need: i. a key column that is unique to each record ii. one or more input columns that are used to form the clusters 3. Association a. Discovering relationships/regularities between data b. Generates association rules (e.g., if (antecedent)/ then (consequence) statements) c. Rule strength is indicated by Support (frequency items appear) and Confidence (number of times rule is true) numbers d. Best rules can be used for marketing campaigns e. Need: i. Granular data ii. A key column that is unique to each transaction /itemset/case (e.g., each basket) iii. A predictable/target variable that is typically the key of the items grouped in an itemset (e.g., each item in a basket) iv. One or more ID/input columns that have discrete values The Association algorithm requires source data to have: A Key column that uniquely identifies an itemset; cannot be a concatenated key A column that serves as a predictable column (typically the key of a nested table) <=item#> One or more input columns that have discrete values NOTE: good website for questions to ask before starting analytics project: questions-first Steps for Data Mining 1. Problem Definition 2. Data Preparation 3. Model Development/Training 4. Model Validation/Evaluation 5. Model Deployment/Use Student Notes 4/15/ :31 AM Page 4 of 38

5 A good, short tutorial for Microsoft SSAS Data Mining (includes link to sample file): Student Notes 4/15/ :31 AM Page 5 of 38

6 Example: Increasing Student Support Decision Tree, First Attempt Problem Definition Here we identify our business goal, followed by careful consideration of opportunities for data mining to assist in achieving the goal. As Kimball puts it: the overall business value goal [should be described] in as narrow and measurable way as possible A goal like reduce the monthly churn rate is a bit more manageable. Next, think about what factors influence the goal. What might indicate that someone is likely to churn? How can we tell if someone would be interested in a given product? try to translate them into specific attributes and behaviors that are known to exist in a usable, accessible form the data miner should work with the business folks to prioritize the various opportunities based on the estimated potential for business impact and the difficulty of implementation. The Microsoft Data Warehouse Toolkit, by Mundy, Thornthwaite, and Kimball <pg 442> We have the following business goals: Increase academic success by students o Identify potentially at-risk students so we can take proactive measures to increase their likelihood of academic success. o We will use classification with a decision tree algorithm Student Notes 4/15/ :31 AM Page 6 of 38

7 Data Preparation The case is the basic unit of analysis in data mining. The objective in data preparation is to build data mining case sets that can be effectively used by the data mining algorithms. A case set is a dataset that includes one row per instance or event or customer. All the information about the instance/event/customer is included in one record. A common example of a case set is a set of customer records containing demographic-type data. However for many events like purchases, the case set may include one row for each product purchased by a customer. This is called a nested case set because it has two components for each case one row representing a customer with customer attributes, and multiple product rows that represent the products purchased by the customer. Data preparation typically involves creating at least two case sets one to be used for training our data mining model, and another to be used for subsequent testing of our model. We could split our source dataset into two datasets for this purpose. However we will use a single dataset. Then when training our model, we will tell SSAS to set aside a percentage of cases for subsequent testing. Another issue in data preparation involves the identification of attributes needed in the case set. Not all attributes in our fact or dim tables will be deemed useful for data mining. There are many published approaches to eliminating attributes from case sets, many of them involving looking at statistical measures such as degree of multicollinearity, high significance (P) values, etc. Other potential eliminators include things like high percentage of missing values, single-valued fields, fields with personally-identifying information, etc. However from a practical standpoint, don t forget the importance of judging attributes based on importance to the business! In the identification of attributes there is also the case where we may need one or more attributes in our case set that do not currently exist in fact or dimension tables (e.g., dependent variables being predicted. In these cases we will need to add these attributes to our case set. Rather than use the base data mart dimensions and facts as our case set, we will instead flatten the data from these sources for use in data mining. Basically, flattening involves going into the source database or data warehouse, and combining data from different tables into a single table using inner and/or outer joins. This flattened dataset is then used as the case set for data mining. Student Notes 4/15/ :31 AM Page 7 of 38

8 To analyze student performance in courses, we create a flattened dataset by creating the view below: CREATE VIEW view_student_performance AS SELECT s.[student_sk], [city], [state_abbreviation], [major], [classification], [gmat], [gpa] as HighSchoolGPA, AVG([coursegrade]) AS AverageCourseGrade, case when avg(coursegrade) >= 3.3 then 'high' when avg(coursegrade) < 3.3 and avg(coursegrade) >= 2.8 then 'medium' else 'low' end as GradeCategory FROM dimstudent s, dimlocation l, factenrollment e WHERE s.student_sk = e.student_sk AND l.location_sk = e.location_sk GROUP BY s.[student_sk], [city], [state_abbreviation], [major], [classification], [gmat], [gpa]; Run the above statement(s) in SSMS. View the data. Note there is one row per student Student Notes 4/15/ :31 AM Page 8 of 38

9 Model Development/Training Model development involves the creation of the data mining structures that will be used to shed insight on the business questions identified previously. The structures that need to be created include: An analysis services project Data Source Data Source View Mining Structure Mining Model(s) 1. Create an analysis services project in SSDT a. We will begin by using our existing ClassPerformanceAS project 2. Create a data source a. We will use the existing data source (ClassPerformanceDWDS) that points to our ClassPerformanceDW data mart in SQL Server 3. Create a data source view (DSV) a. We will modify the existing DSV i. Right-click in the DSV display area and add/remove tables ii. Select the new view(s) we created during data preparation iii. Save the DSV. Next we create mining models to help achieve our two business goals. We also supply data to our model algorithms to train the algorithm on a particular aspect of the organization. We start with a decision-tree based model for predicting performance. 4. In Solution Explorer, right-click Mining Structures subfolder; click New Mining Structure 5. Choose From existing relational database or data warehouse ; Next 6. Choose to Create mining structure with a mining model and select the Microsoft Decision Trees algorithm; Next 7. Select the ClassPerformanceDW DSV; Next 8. Check the Case box next to the view view_student_performance; Next 9. On the Specify the Training Data dialog page, a. Check the Key box next to student_sk if not already selected b. Check the Predictable box next to GradeCategory. This action identifies that column as the column to be predicted. c. Click the Suggest button. This allows SSAS to recommend additional input columns based on whether they correlate with the Predictable variable at higher than.05; click OK. d. Remove AverageCourseGrade as a model column by unchecking the box to the left of the column name e. Ensure city, classification, gmat, gpa, major, and state selected as input columns; Next f. Click Next Student Notes 4/15/ :31 AM Page 9 of 38

10 10. On the Specify Columns Content and Data Type dialog page, a. Ensure city, classification, GradeCategory, major, and state, have Content Types of Discrete. You can click the dropdown and change the types manually OR you can click the Detect button. 11. Next we set aside a portion of data that will be used for validating our model a. Change the Percentage of data for testing to 25%; Next 12. On the Completing the Wizard dialog page, a. Name the mining structure Predict Successful Students b. Name the mining model Predict Successful Students Decision Tree c. Check the Allow drill through box d. Click Finish 13. Redeploy the project The next appropriate step is to create alternate mining models within this same mining structure that can also predict donors using algorithms such as Naïve Bayes and/or neural net. This is sometimes referred to as triangulation. Ideally the different algorithms would provide similar results. By trying several algorithms we (1) have the ability to compare models to determine the best predictor of donors, and (2) provide more confidence to our users that our predictions are valid if their outcomes are consistent. We will skip triangulation and proceed to explore the results of our decision tree model. We can view the resulting decision tree created by going to the Mining Model Viewer. There are two ways to view a decision tree model: the Decision Tree view, and the Dependency Network view. For additional information, refer to the following Microsoft website: Decision Tree View The decision tree view shows us the decision steps used to predict the target variable. 1. Click on the Mining Model Viewer tab. 2. Click on the Decision Tree tab. As a result of running a decision tree analysis, one tree is created for each predictable column. Since we identified only one predictable column, there is only one tree to view in this tab. Student Notes 4/15/ :31 AM Page 10 of 38

11 The resulting tree should have multiple levels; however we have only one level. One issue is that our sample size is way too small; analysis services will stop splitting nodes or fail to split a node if the number of cases is too small. Another issue is that because our class performance data is fictitious with values randomly assigned, Analysis Services likely could not find an input variable that predicted/differentiated successful students better than other input variables. So we will use another dataset to demonstrate decision tree analysis. Example: Increase Student Support Decision Tree Problem Definition We have the following business goals: Anticipate financial needs of students o Predict whether a potential student will require financial aid We will use classification with a decision tree algorithm Increase academic success by students o Determine student profiles so that we can better serve students with common characteristics We will use a clustering algorithm Data Preparation To predict student need for financial aid, we create a flattened dataset by creating the view below: CREATE VIEW StudentInfo AS SELECT dbo.dim_student.student_ak, dbo.fact_academic.act, dbo.fact_academic.sat, dbo.fact_academic.high_school_gpa, dbo.fact_academic.highschoolrank, dbo.dim_student.first_name, dbo.dim_student.last_name, dbo.dim_student.birth_date, dbo.dim_student.marital_status, dbo.dim_student.gender, dbo.dim_student.[full_time/part_time], dbo.dim_student.legacy_status, dbo.dim_student.transfer_flag, dbo.dim_student.state, dbo.dim_student.zip_code, dbo.dim_student.country, dbo.dim_student.financial_aid FROM dbo.dim_student INNER JOIN dbo.fact_academic ON dbo.dim_student.student_sk = dbo.fact_academic.student_sk WHERE (dbo.dim_student.student_sk <> - 1); Run the above statement(s) in SSMS. View the data. Note there is one row per student. Student Notes 4/15/ :31 AM Page 11 of 38

12 Model Development/Training 1. Create an analysis services project in SSDT a. We will begin by creating a new HigherEdDM analysis services project/solution (store on desktop) 2. Create a data source a. Create a data source (HigherEdDW DS) that points to our existing HigherEdDW data mart in SQL Server 3. Create a data source view (DSV) a. Create a DSV (HigherEdDW DSV) that contains only the new view(s) we created during data preparation 4. In Solution Explorer, right-click Mining Structures subfolder; click New Mining Structure 5. Choose From existing relational database or data warehouse ; Next 6. Choose to Create mining structure with a mining model and select the Microsoft Decision Trees algorithm; Next 7. Select the HigherEDW DSV; Next 8. Check the Case box next to the view StudentInfo; Next 9. On the Specify the Training Data dialog page, a. Check the Key box next to student_ak if not already selected b. Check the Predictable box next to Financial_Aid. This action identifies that column as the column to be predicted. c. Click the Suggest button. This allows SSAS to recommend additional input columns based on whether they correlate with the Predictable variable at higher than.05. Click the input cells for columns that are visible (not grayed-out); click OK d. Choose HighSchoolRank and SAT as additional input columns by checking the box Input boxes to the right of the column names; Next e. Click Next 10. On the Specify Columns Content and Data Type dialog page, a. Click the Detect button. b. Click Next 11. Next we set aside a portion of data that will be used for validating our model a. Keep the Percentage of data for testing at 30%; Next 12. On the Completing the Wizard dialog page, a. Name the mining structure Predict Financial Aid b. Name the mining model Predict Financial Aid Decision Tree c. Check the Allow drill through box d. Click Finish 13. Deploy the project (be sure that the OLAP login has datareader and datawriter roles mapped in SSMS) Student Notes 4/15/ :31 AM Page 12 of 38

13 Decision Tree View The decision tree view shows us the decision steps used to predict the target variable. 1. Go to the Mining Model Viewer tab 2. Click on the Decision Tree tab. As a result of running a decision tree analysis, one tree is created for each predictable column. Since we identified only one predictable column, there is only one tree to view in this tab. If there are multiple levels in the tree, we can expand or reduce the number of tree levels shown by moving the Show Level slider appropriately. A quick look at the tree tells us that a student s ACT score appears to be the primary predictor of whether or not the student will require financial aid and after accounting for the ACT score, whether or not the student is a transfer student appears to be the next important predictor. The mining legend area gives us guidance in how to interpret the node and histogram bar colors. Each node represents a subset of cases in the decision tree. By default, the darkest nodes indicate the nodes with the largest number of cases. This default can be changed by setting the Background field (see description below). Hovering over a node provides data about the cases in that node. From it, you can see the condition required to reach that node from the node preceding it. For example, hovering over the Leaf node (ie, a lowest-level node) representing ACT s >=19 and Transfer Flags = 1, you can see that there were 2,264 cases of students meeting that criteria who required financial aid while 906 of students meeting that same criteria did not require financial aid. Student Notes 4/15/ :31 AM Page 13 of 38

14 By examining the histogram bars within the nodes you can visually see the approximate ratio of nonfinancial aid students to financial aid students. According to the mining legend, the blue color histogram bar indicates students that received financial aid (predict value = 1); the red color bars indicate students who did not receive financial aid (predictable value = 0). A visual examination of these bars tells us, for example, that all the students with ACT scores under 11 required financial aid, whether they were transfer students or not. Similarly, the majority of students with ACT scores between 11 and 15 did NOT require financial aid, regardless of their transfer status. The Background field shows which predictable column cases are being highlighted: yes or true cases (meaning the tree is showing the darkest color where more cases of students receiving financial aid), no or false (the tree is showing the darkest color where there are more cases of students NOT receiving financial aid), cases missing prediction values, or All cases (the tree is showing the darkest color where there are more total cases). Dependency Network View The Dependency Network tab shows the relationships between the attributes that contributed to the predictive ability of the mining model. 1. Click on the Dependency Network tab By default, all the predictive attributes are shown. However, by moving the slider on the left side, you can see which attributes are the strongest predictors in this case, ACT scores. This can also be seen on the previous Decision Tree tab as ACT scores were the main variables in the first split of the tree. Student Notes 4/15/ :31 AM Page 14 of 38

15 Model Validation/Evaluation After reviewing the results of the mining algorithm in the Mining Model Viewer, we validate the model. Validating the model involves checking the accuracy of the results produced by the algorithm, and comparing the predictive ability of multiple models/algorithms. From the Mining Accuracy Tab, there are three tools for model validation: Lift chart Classification Matrix Cross Validation tool Model validation begins with using the test data that we set aside previously as input to our mining model(s). The model then predicts the predictable attribute in this test data set. We then validate the model by comparing the model s predictive performance against the known outcomes. Prepare Test Data 1. Click on the Mining Accuracy Chart tab 2. Click on the Input Selection tab 3. Ensure Synchronize Prediction Columns and Values is checked, and that all mining models are selected 4. Ensure Use Mining model test cases radio button is selected at the bottom Lift Chart with No Predict Value to Validate Model 5. Click on Lift Chart tab a. The Blue line represents the Ideal Model b. The red line represents the predictive ability of our model c. The dark gray vertical line indicates what percentage of the sample population the Mining Legend is displaying statistics for Student Notes 4/15/ :31 AM Page 15 of 38

16 This version of the lift chart does not contain a Target Value; i.e., we did not specify on the input selection tab whether we wanted to see the accuracy of yes predictions vs no predictions, so this current lift chart shows ALL predictions. We can see that when 50% of the population is processed, our model predicts 37.2% of financial aid students vs. non-financial aid students correctly as opposed to the perfect (blue line) model which would predict all 50% of the population correctly. If we had used multiple models, additional lines would be included on the chart showing the predictive ability of each model for comparison. The Predict Probability is similar to a confidence level; it tells us that our decision tree model will correctly predict 37.2% of the current population IF you rely on the results that have a 62.49% predict probability (i.e., confidence) or higher. Student Notes 4/15/ :31 AM Page 16 of 38

17 Lift Chart with Predict Value to Validate Model 6. Click on Input Selection tab 7. Change Predict Value to Yes (or 1) 8. Return to Lift Chart tab a. The Blue Line now represents random guessing b. The Green Line represents the ideal model c. The Red Line still represents our model Now we see that when 50% of the population is processed, our model predicts approx. 63% of students who will receive financial aid correctly. This is better than random guessing. Many say that this type of lift chart is more valuable to analyze. Classification Matrix Allows us to compare predicted versus actual values. 9. Click on the Classification Matrix tab The numbers circled in green represent cases where the model correctly predicted financial aid students (i.e., true positives); the numbers circled in red represent incorrectly predicted financial aid students (i.e., false positives). Using the test data, our model resulted in 3824 correct predictions, and 2176 incorrect predictions. Note that if we used multiple models, we could have compared accuracy of the models using this matrix. Student Notes 4/15/ :31 AM Page 17 of 38

18 Model Deployment/Use At this point we have already developed, compared, selected, and deployed a data mining model ready to support real-world prediction and decision-making. We can preview the ability for users and/or client applications to issue (DMX) queries against our model to retrieve results for a specific case or for a batch of cases. Issuing DMX queries against a previously-created mining model is how client applications can use our mining model to gain real-time insights and/or predictions. We can preview this functionality in the Mining Model Prediction tab of SSDT. Two types of DMX prediction queries are supported via SSDT: a single-row query (AKA Singleton query), and a batch query (AKA batch or Prediction Join query). We will look at examples of both. In addition, two ways to create your DMX query are supported via SSDT: using a GUI that generates the DMX for you, or by entering the DMX manually. We will look at examples of both. Singleton Query Using a singleton query, we can feed a single case as input to a mining model in order to retrieve the predicted value for that case. This is how an application could use the mining model in real-time. 1. With the mining model open, click on the Mining Model Prediction tab, and click on the Singleton Query button Student Notes 4/15/ :31 AM Page 18 of 38

19 2. Click the Select Model button 3. Highlight the Predict Financial Aid Decision Tree model and click OK In the Singleton Query Input window, you can enter the specific values you want the Decision Tree model to predict. Then you can specify which values you want the DMX query to return to you in the bottom area of the Mining Model Prediction window. At a minimum, you d want the query to return the predictable attribute, in our case, the Donated attribute. In addition, we will ask the model to return the probability that the predicted attribute is true, as well as the probability the predicted attribute is false. 4. Enter values as shown below, then click on the View dropdown and select Results: Student Notes 4/15/ :31 AM Page 19 of 38

20 Results of our prediction query are shown below. We aliased the two expressions so that they appear in the result set with user friendly names (similar to aliasing columns in SQL). 5. Click on the Query view in order to see the DMX query that SSDT generated to produce our results. 6. Copy/paste the query into SSMS and run it from there as well. Be sure to open a DMX query window. Note that we can modify the DMX code to do the aliasing if we didn t do in the SSDT version of the DMX query: Student Notes 4/15/ :31 AM Page 20 of 38

21 The previous example illustrates how a user or application can dynamically retrieve a prediction result for a single case. However, our user/application needs might require us to calculate values for a group of casesin this case we need to create a Prediction Join query. Prediction Join Query A prediction join query or batch query allows us to feed multiple cases to a mining model and retrieve the resulting predictions for each of the cases. The cases need to be contained in a table. Ideally we would have new a set of data that we would use as input in a prediction join query; however for the purpose of our class example, we will use the same set of data that we used to create the model. 1. With the mining model still open in the Mining Model Prediction tab, click again on the Singleton Query button to toggle us back to Prediction Query mode (you do not need to save the previous query). 2. Click Select Case Table, choose the StudentInfo view table, then click OK Notice that SSDT automatically tries to map fields used in the mining model to fields in the input table. However we do not want it to use the Financial_Aid field in the input table this is the field we want the model to predict. Therefore, we need to remove the mapping for this field. 3. Click on the mapping line going from Financial Aid in the mining model to Financial_Aid in the input table; right-click on the line and Delete it. Student Notes 4/15/ :31 AM Page 21 of 38

22 4. Once again, we now choose the data we want to appear in the results of our mining model. Set the fields as indicated below: 5. Click the dropdown to view the Result of running the query. We can save these results as a relational table so that users/applications will have access to it. 5. Click on the Save button on the Mining Model Prediction tab. We can save the results as a new table in the database of our datasource. Student Notes 4/15/ :31 AM Page 22 of 38

23 Example: Increase Student Support Clustering Problem Definition We have the following business goals at BT University: Anticipate financial needs of students o Predict whether a potential student will require financial aid We will use classification with a decision tree algorithm Increase academic success by students o Determine student profiles so that we can better serve students with common characteristics We will use a clustering algorithm Data Preparation To predict if a student will graduate in fewer than 4 years, we create a flattened dataset as follows: <20,000 recs> CREATE VIEW StudentInfoDetailed AS SELECT dbo.dim_student.student_ak, dbo.fact_academic.act, dbo.fact_academic.sat, dbo.fact_academic.high_school_gpa, dbo.fact_academic.highschoolrank, dbo.dim_student.first_name, dbo.dim_student.last_name, dbo.dim_student.birth_date, dbo.dim_student.marital_status, dbo.dim_student.gender, dbo.dim_student.race, dbo.dim_student.[full_time/part_time], dbo.dim_student.legacy_status, dbo.dim_student.transfer_flag, dbo.dim_student.state, dbo.dim_student.zip_code, dbo.dim_student.country, dbo.dim_student.financial_aid, CASE WHEN Major_1_name <> CONVERT(varchar, - 1) AND Major_2_name <> CONVERT(varchar, - 1) THEN 2 WHEN Major_1_name <> CONVERT(varchar, - 1) AND Major_2_name = CONVERT(varchar, - 1) THEN 1 WHEN Major_1_name = CONVERT(varchar, - 1) AND Major_2_name <> CONVERT(varchar, - 1) THEN 1 WHEN Major_1_name = CONVERT(varchar, - 1) AND Major_2_name = CONVERT(varchar, - 1) THEN 0 ELSE 0 END AS num_majors, CASE WHEN YEAR(start_date) - YEAR(graduation_date) < 4 THEN 1 WHEN YEAR(start_date) - YEAR(graduation_date) >= 4 THEN 0 END AS early_graduate, YEAR(dbo.dim_student.Start_Date) - YEAR(dbo.dim_student.Birth_Date) AS age FROM dbo.dim_student INNER JOIN dbo.fact_academic ON dbo.dim_student.student_sk = dbo.fact_academic.student_sk WHERE (dbo.dim_student.student_sk <> - 1); Run the above statements in SSMS. View the data. Note there is one row per student. Student Notes 4/15/ :31 AM Page 23 of 38

24 Clustering attempts to group similar records together and does not need a target variable specified in advance. It is more exploratory in nature and is sometimes done before other, more predictive techniques in order to better understand the data. We will use the same analysis services project (HigherEdDM) and the same data source (HigherEdDW DS). But because we are using a different view we ll need to add it to our current DSV. Model Development/Training 1. Update the existing data source view (DSV) a. Double-click the HigherEdDW DSV to open it, if it is not already open b. Right-click in an empty area of the DSV design area and choose Add/Remove Tables c. Highlight the StudentInfoDetailed view and click the right arrow to move it to Included objects; OK d. Save the DSV 2. In Solution Explorer, right-click Mining Structures subfolder; click New Mining Structure 3. Choose From existing relational database or data warehouse ; Next 4. Choose the Microsoft Clustering technique 5. Choose the Higher Ed DW DSV 6. Check the Case box next to the StudentInfoDetailed view to indicate the case set we ll be using for the analysis 7. Choose Student_AK as the Key column; with the exception of Birth_Date, First_Name, Last_Name, and Zip_Code, set all other variables as Input variables 8. Click Detect to have SSAS detect the type of data stored in the input variables 9. Choose 30% for testing 10. Name the Mining Structure Student Info Detailed; name the Mining Model Student Info Detailed Clustering; check box to Allow drill through; Finish You can guide how mining algorithms work by overriding default parameters. 1. Click the Mining Models tab 2. Right-click the clustering model and choose Set Algorithm Parameters 3. Set the CLUSTER_COUNT parameter to 0 which allows the clustering algorithm to determine the best number of clusters to create; click OK 4. Save and Redeploy the project. 5. Go to the Mining Model Viewer tab to view the resulting cluster diagram. There are 4 sub-tabs for viewing the results of clustering: Cluster Diagram, Cluster Profiles, Cluster Characteristics, and Cluster Discrimination. Student Notes 4/15/ :31 AM Page 24 of 38

25 Cluster Diagram View On the Cluster Diagram tab, we can see that 6 clusters have been created. The connecting lines tell us how closely related one cluster is to another. The darker the line the stronger the relationship. The slider on the left can be adjusted to see only strongest links up to all links. In our case, the strongest links are between clusters 1 and 5. By default, the Shading Variable is Population. This means that the clusters with darker shades have the most cases in that group. If you want to examine the number of cases in a group that have a certain attribute value, you can change the shading variable. 1. Change the Shading Variable to State. The State box will default to Alabama since values are alphabetized; change to Texas. 2. Cluster 3 is the darkest. Hover over cluster 3; notice that 3% of the cases in that cluster have State values of Texas. Student Notes 4/15/ :31 AM Page 25 of 38

26 Cluster Profiles View A challenge when using the clustering algorithm is determining what profile each cluster represents. 1. Click on the Cluster Profiles tab. The first profile shown is the entire population, as a reference. Also notice that discrete values are presented as bars, where continuous values are presented as sliders. The Histogram Bars field shows how many bars are visible in the attribute profiles. Each bar corresponds to distinct values. If more values exist than the number of bars you display, the remaining values are grouped together into a gray bucket. One obvious observation is that all the country values are the same. Also Age, Gender, Legacy Status, Marital Status, Num Majors, Race, and State appear to not vary much across clusters. We can go back to the mining models tab, and change these fields from Input to Ignore. If you do this, redeploy the project. Student Notes 4/15/ :31 AM Page 26 of 38

27 It s still pretty hard to distinguish between the 6 profiles, so we can narrow it down to a smaller number. 2. Change the CLUSTER_COUNT parameter to 4; redeploy and view the updated cluster profiles. Note the cluster diagram has changed to 4 nodes as well (population column manually hidden). There are a few distinguishing cluster differences: Cluster 1 has students almost all of whom are on financial aid, who are transfers, and with low ACTs. Cluster 4 has students who are not on financial aid, not transfer students, and with aboveaverage ACTs. Cluster 2 has students on some financial aid but not all, with high ACTs and high GPAs Cluster 3 has a mix of financial aid and nonfinancial aid students but with low GPAs We can rename the clusters to help us better identify them. 3. Right-click in the cluster1 column heading and select Rename Cluster a. Rename cluster1 to Financial Aid Students b. Rename cluster2 to High Performers c. Rename cluster3 to Low GPAs d. Rename cluster4 to Non Financial Aid Students Student Notes 4/15/ :31 AM Page 27 of 38

28 Cluster Characteristics View This view allows you to examine the characteristics that make up a selected cluster. Attributes are shown in order of importance, and includes the probability that the attribute appears in the cluster. 1. Select the Financial Aid Students cluster in the Cluster: dropdown 2. Select different clusters and see if/how the important variables change. Student Notes 4/15/ :31 AM Page 28 of 38

29 Cluster Discrimination View This view helps determine which attributes most differentiate a selected cluster from all other clusters. 1. Choose the Financial Aid cluster as Cluster 1; notice that the lack of high ACT scores is the primary distinguishing factor of this cluster from the others; after that is Transfer status. 2. Choose the High Performers cluster as Cluster 1; notice that the primary distinguishing factor from high performers and non-high performers appears to be whether they graduate early. Change Cluster 2 to Low GPAs; when High Performers are compared to the Low GPAs cluster, GPAs and ACT scores set those apart. Student Notes 4/15/ :31 AM Page 29 of 38

30 Model Validation/Evaluation As we are not doing any predicting, then examining Model Accuracy and Model Prediction do not apply for our use of clustering. Model Deployment/Use To deploy and use resulting clusters, we can use the methods discussed earlier when we discussed decision trees. 1. Go to Mining Model Prediction tab 2. Click Select Case Table and choose the StudentInfoDetailed (dbo) tble 3. Design the DMX query as shown below: 4. Click on the Results view 5. Click the disk to Save the results as a table (Segment Students) in SSMS 6. Switch to the Query view to see the DMX query that produced the model results Student Notes 4/15/ :31 AM Page 30 of 38

31 SELECT Cluster(), t.[act], t.[early_graduate], t.[financial_aid], t.[transfer_flag], t.[high_school_gpa] From [Student Info Detailed Clustering] PREDICTION JOIN OPENQUERY([Higher Ed DW DS], 'SELECT [act], [early_graduate], [Financial_Aid], [Transfer_Flag], [high_school_gpa], [sat], [highschoolrank], [Full_Time/Part_Time] FROM [dbo].[studentinfodetailed] ') AS t ON [Student Info Detailed Clustering].[Act] = t.[act] AND [Student Info Detailed Clustering].[Sat] = t.[sat] AND [Student Info Detailed Clustering].[High School Gpa] = t.[high_school_gpa] AND [Student Info Detailed Clustering].[Highschoolrank] = t.[highschoolrank] AND [Student Info Detailed Clustering].[Full Time Part Time] = t.[full_time/part_time] AND [Student Info Detailed Clustering].[Transfer Flag] = t.[transfer_flag] AND [Student Info Detailed Clustering].[Financial Aid] = t.[financial_aid] AND [Student Info Detailed Clustering].[Early Graduate] = t.[early_graduate] 7. Rather than use the DMX query generated in SSDT, enter the DMX query below in SSMS (ensure the HigherEdDM database & StudentInfoDetailed clustering model has been selected): SELECT t.*, Cluster() FROM [Student Info Detailed Clustering] NATURAL PREDICTION JOIN (SELECT * FROM [Student Info Detailed Clustering].CASES) as t order by cluster() Just as with decision trees, we can also generate a singleton query to identify the cluster group that an individual case most closely fits. 6. Enter the DMX query below in SSMS (it was generated in SSDT using a singleton query): SELECT Cluster() From [Student Info Detailed Clustering] NATURAL PREDICTION JOIN (SELECT 30 AS [ACT], 0 AS [Transfer Flag], 200 AS [High School Rank]) AS t Student Notes 4/15/ :31 AM Page 31 of 38

32 Data Mining with Excel Excel can be used as a client to create and execute data mining models, in lieu of or addition to SSAS. Excel has two Data Mining Interfaces: one developer or power-user oriented, the other end-user oriented. A key difference between the two interfaces is that the developer-oriented one can generally work directly with SQL Server or SSAS data; the end-user oriented interface requires a local copy of the data be stored in Excel. For this example we will use the user-oriented interface. Example: Course Mixture -- Association Data Preparation The Association algorithm analyzes groups of related items and predicts the likelihood of items occurring together. The Association algorithm is used frequently in retail as Market Basket Analysis where an item would be analogous to an individual product in a customer s shopping basket, and an itemset would represent all the combinations of items purchased in shopping basket transactions. This application of association is used to gain insights about customer purchases, including which products are purchased together for potential cross-selling, which products benefit from promotions, etc... The association algorithm produces association rules about how items are related e.g., product X is ordered with product Y with Z degree of statistical confidence. Rules can include recommendations recommendations are rules that exceed a certain probability threshold that you can specify. The Association algorithm requires source data to have: A Key column that uniquely identifies an itemset; cannot be a concatenated key A column that serves as a predictable column (typically the key of a nested table One or more input columns that have discrete values Source data for the association algorithm is often in a flattened dataset created from a table containing transactions. We will create a flattened dataset based on student course enrollments in the ClassPerformanceDW database. Each enrollment is identified by a combination of student_sk and class_sk. However the association algorithm requires a key that identifies the market basket transaction. For our example, one student s set of enrolled classes represents a market basket. Therefore we have to identify (or create if it didn t exist) the atomic key that identifies the market basket transaction when we setup the association analysis. Note that the Excel Association mining algorithm cannot accept a concatenated key. 1. Execute the SQL command below in the ClassPerformanceDW database in SSMS: create view view_enrollments as select top 100 percent [student_sk] as transactionid, fe.[class_sk], [coursename], [ActualDate] from factenrollment fe, dimclass c, dimtime t where fe.class_sk = c.class_sk and fe.date_sk = t.datesk order by transactionid; Student Notes 4/15/ :31 AM Page 32 of 38

33 Note that TOP % is used to circumvent the limitation of having an Order By in a view. Order By is used to ensure each student s registrations are grouped together. Configure a connection to SSAS 1. Open Excel; open a Blank Workbook 2. Click on the Data Mining tab on the ribbon 3. Click on the <No Connection> button 4. Click New and choose the Server and Catalog names as indicated below, then click OK: 5. Click Close. Student Notes 4/15/ :31 AM Page 33 of 38

34 Import data 1. Go to the Data tab 2. In the External Data group, click on the From Other Sources dropdown, choose From SQL Server, then choose options as follows: Set the correct database Server Name Choose the ClassPerformanceDW database Accept the defaults Ensure cell $A$1 selected Student Notes 4/15/ :31 AM Page 34 of 38

35 Explore Data You can Explore data that will be used for mining. Exploring data in Excel gives you visual plots of distribution of column values: 1. Click Data Mining tab 2. Click Explore Data; Next; Next 3. On the Select Column page, select a column to explore (e.g., Coursename) by clicking on the column heading; Next 4. You should see a histogram showing the Values of the column you chose and how many records have those values; click Finish Clean/Transform Data You can also Clean/transform data that will be used for mining. For example you can remove outlier values and expand data values (e.g., TX becomes Texas). We ll do an example of re-labeling. 1. Click Data Mining tab 2. Click Clean Data dropdown and choose Re-Label; Next; Next 3. On the Select Column page, select a column to re-label (e.g., Coursename); Next 4. Specify New Labels for a one course (e.g., change db to database); Next 5. Select Change data in place ; Finish Model Development/Training Like SSAS, Excel provides wizards that help you build data mining models without having to understand the details of the algorithms that the models are built on. The data modeling group of the Data Mining tab shows that Excel supports the development of the following types of models: Classification (i.e., Decision Tree, discrete value prediction) Estimation (i.e., Decision Tree, continuous value prediction) Cluster Associate Forecast (i.e., Time Series) Advanced (above plus Regression, Naïve Bayes, and Neural Networks) Student Notes 4/15/ :31 AM Page 35 of 38

36 We ll use the Shopping Basket Analysis wizard to create an association-based model to understand which courses tend to be taken together. 1. Click anywhere in the table of data to expose the Table Tools contextual tabs 2. On the Analyze tab, click Shopping Basket Analysis 3. Set fields as indicated below, then click Advanced 4. Set minimum support and minimum rule probability as indicated below; OK, then Run Student Notes 4/15/ :31 AM Page 36 of 38

37 Model Validation/Evaluation The resulting itemsets appear on the Shopping Basket Bundled Item worksheet, in descending order of the number of times the grouping of items appears in transactions. Note that those itemsets that didn t meet the minimum support criteria (combination occurs in 10% of sales /transactions) are not shown here. Because we had only 11 baskets/students/transactions/ sales, the minimum required was Truncate(11 sales *.10) = 1 sale. So basically all our itemsets are shown. Student Notes 4/15/ :31 AM Page 37 of 38

38 A second worksheet, Shopping Basket Recommendations, shows the subset of itemsets that meet the minimum probability criteria these are also known as the association rules. For our example the minimum probability criteria was a minimum of 40% of baskets/students/transactions/ sales. In the example below, class #2 appeared in 6 baskets/students/transactions/ sales, but class #8 appeared on only 5 of 6 of those sales, resulting in 5/6 = 83.33% of linked sales. The Importance represents the statistical confidence in the association rule. Model Deployment/Use If we had completed this Association analyses in SSAS, we could have generated DMX queries that could be issued by users and/or applications to make recommendations of products. Student Notes 4/15/ :31 AM Page 38 of 38