Data Mining is the process of knowledge discovery involving finding

Transcription

1 using analytic services data mining framework for classification predicting the enrollment of students at a university a case study Data Mining is the process of knowledge discovery involving finding hidden patterns and associations, constructing analytical models, performing classification and prediction, and presenting mining results. Data Mining is one of the functional groups that is offered with Hyperion System 9 BI+ Analytic Services a highly scalable enterprise class architecture analytic server (OLAP). The Data Mining Framework within Analytic Services integrates data mining functions with OLAP and provides the users with highly flexible and extensible on-line analytical mining capabilities. On-line analytical mining greatly enhances the power of exploratory data analysis by providing users with the facilities for data mining on different subsets of data at different levels of abstraction in combination with the core analytic services like drill up, drill down, pivoting, filtering, slicing and dicing all performed on the same OLAP data source. introduction This paper focuses on using Naïve Bayes, one of the Data Mining algorithms (shipped in-the-box with Analytic Services) to develop a model to solve a typical business problem in the admissions department at an academic university referred to as ABC University in the paper. The paper details out the approach that is taken by the user to solve the problem and explains the various steps that are performed by using Analytic Services in general and the Analytic Services Data Mining Framework in particular, towards arriving at the solution. problem statement One of the problems related to managing admissions that typical universities face is to be able to predict with reasonable accuracy the likelihood that an applicant would eventually enroll in an academic program. Universities typically incur a considerable expense in promoting their programs and in following up with prospective candidates. Identifying applicants with a higher likelihood of enrollment into the program will help the university channel the promotional expenditure in a more gainful way. The candidates typically apply to more than one university to widen their chances of getting enrolled within that academic year. Universities that can quickly arrive at a decision on the applicant stand a higher chance of getting acceptance from candidates. ABC University collects from applicants a variety of data as part of the admissions process: demographic, geographic, test scores, financial information, etc. In addition to that, the admissions department at the ABC University also has

2 acceptance information from the previous year s admissions process. The problem at hand is to use all this available data and predict whether an applicant will choose to enroll or not. The ABC University is also interested in analyzing the composite factors influencing the enrollment decision. This additional analysis is useful in adjusting the admissions policy at the university and also in ensuring effective cost management in the admissions department. available data The admissions department is currently gathering demographic, geographic, test scores, financial information, etc., from applicants as part of the admissions process. There is also historical data available indicating the actual enrollment status of applicants along with all the other attributes that were collected as part of the admission process. The dataset made available has 33 different attributes for each applicant inclusive of the decision result attribute. There are in all about records available. Table 1: List of potential mining attributes available in database 2

3 preparing for data mining cube is the data source The algorithms in the Data Mining Framework are designed to work on data present within an Analytic Services cube. The design of the cube should take into consideration the data needs for all kinds of analyses (OLAP and Data Mining) that the user is interested in performing. Once the data is brought into the cube environment it can then be accessed through the Data Mining Framework for predictive analytics. The Data Mining Framework uses MDX expressions to identify sections within the cube to obtain input data for the algorithm as well as to write back the results. The Data Mining Framework can only take regular dimension members as mining attributes. What this implies is that only data that is referenced through regular dimension members (not through attribute dimensions or user defined attributes) can be presented as input data to the Data Mining Framework. Accordingly, the data that is required for predictive analytics should be modeled within the standard dimensions and measures within a cube. In the case study being discussed in this paper, the primary business requirement was to build a classification model for prediction. Since there were no other accompanying business requirements, the design of the Analytic Services cube was primarily driven by the Data Mining analytics need. For example, we have not used any attribute dimension modeling in the case study. However, in the generic case it is more likely that the cube caters to both regular OLAP analytics and predictive analytics within the same dimensional model. preparing mining attributes The available input data can broadly be of two data types number or string. However, since measures in Analytic Services are essentially stored in the database in a numerical format, the string type input data will have to be encoded into a number type data before being stored in Analytic Services. For example, if the gender information is available as a string stating Male or Female it needs to be first encoded into a numeric like 1 or 0, before being stored as a measure in the Analytic Services OLAP database. Mining attributes can be of two types categorical or numerical. Mining attributes that describe discrete information content like gender ( Male or Female ), zip code (95054, 94304, 90210, etc.), customer category ( Gold, Silver, Blue ), status information ( Applied, Approved, Declined, On Hold ), etc. are termed categorical attribute types. Mining attributes that describe continuous information content like sales, revenue, income, etc. are termed numerical attribute types. The Analytic Services Data Mining Framework has the capability of working with algorithms that can handle both categorical and numerical attribute types. Among the algorithms that are shipped in the box with the Analytic Services Data Mining Framework, the Naïve Bayes and the Decision Tree algorithms have the capability to handle both categorical as well as numerical mining attribute types and treat them accordingly. One of the key steps in Data Mining is the data auditing or the data conditioning phase. This involves putting together, cleansing, categorizing, normalizing, and proper encoding of data. This step is usually performed outside the Data Mining tool. The effectiveness of the Data Mining algorithm is largely dependent on the quality and completeness of the source data. In some cases, for various mathematical reasons, the available input data may also need to be transformed before it is brought into a Data Mining environment. Transformations may sometimes also include splitting or combining of input data columns. Some of these transformations may be done on the input dataset outside the Data Mining Framework by using standard data manipulation techniques available in ETL tools or RDBMS environments. For the current case the input data does not need any mathematical transformation, but some encoding is needed to convert data into a format that can be processed within the Analytic Services OLAP environment. In the current problem at the ABC University, the available set of input data consisted of both string and number data types. The list below gives some of the input data, which needed encoding of string type input into number type input: Identity related data like Gender, City, State, Ethnicity Data related to the application process like Application Status, Primary Source of contact, Applicant Type, etc. Date related data like Application Date, Source Date, etc. (Dates were available in the original dataset as strings, specifically they had two different formats yymmdd and mm/dd/yy, and they had to be encoded into a number.) In the current case study, these encodings were done outside the Analytic Services environment by the construction of look-up master tables where the string type input were listed in a tabular format and the records were sequentially numbered. Subsequently, the string type input was referred to by its corresponding numeric identifier during data load into Analytic Services. Table 2 shows a few samples of how such mapping files will look like. State ID State Name VT CA MA MI NH NJ AppliedStatus ID Application Status Applied Offered Admission Paid Fees Enrolled Table 2: Typical mapping of numeric identifiers 3

4 preparing the cube After all the input data has been identified and made ready, the next step is to design an outline and load the data into an Analytic Services cube. In the context of the current case the Analytic Services outline created was as follows: All the input data (measures in the OLAP context) were organized together into five groups (a two level hierarchy created in the measures dimension) based on a logical grouping of measures. The details of each of the measure are explained in the table below -Table 3: Analytic Services outline expanded. Data load is performed just as it is normally done for any Analytic Services cube. At this stage we have: Designed an Analytic Services cube Loaded it with relevant data It should be noted that the steps described so far are generic to Analytic Services cube building and did not need any specific support from the Analytic Services Data Mining Framework. Measure Group Explanation Measures related to information about the applicants identity were organized into this group. Some of these measures were transformed from string type to number type to facilitate modeling it within the Analytic Services database context. Measures related to various test scores and high school examination results were organized into this group. Measures related to the context of the applicants application processing have been organized together into this group. Measures related to the academic background. Measures providing information about the financial support and funding associated with the applicant. Table 3: Analytic Services outline expanded 4

5 identifying the optimal set of mining attributes It is necessary to reduce the number of attributes / variables presented to an algorithm so that the information content is enhanced and the noise minimized. This is usually performed using supporting mathematical techniques to ensure that the most significant attributes are retained within the dataset that is presented to the algorithm. It should be noted here that the choice of significant attributes are more driven by the particular data rather than by the problem itself. Attribute analysis or attribute conditioning is one of the initial steps in the Data Mining process and is currently performed outside the Data Mining Framework. The main objective during this exercise is to identify a subset of mining attributes that are highly correlated with the predicted attribute; while ensuring that the correlation within the identified subset of attributes is as low as possible. The Analytic Services platform provides for a wide variety of tools and techniques that can be used in the attribute selection process. One method to identify an optimal set of attributes is to use certain special data reduction techniques implemented within Analytic Services through Custom Defined Functions (CDFs). Additionally, users can use other data visualization tools like Hyperion Visual Explorer to arrive at a decision on the effectiveness of specific attributes in contributing to the overall predictive strength of the Data Mining algorithm. Depending on the nature of the problem the users may choose to utilize an appropriate tool and technique in deciding the optimal set of attributes. One of the advantages of working with the Analytic Services Data Mining Framework is the inherent capability in Analytic Services to support customized methods for attribute selection by the use of Custom Defined Functions (CDFs). This is essential since the process of mining attribute selection can vary significantly across various problems and having an extensible toolkit comes in very handy to be able to customize a method to suit a specific problem. In the current case at ABC University, a CDF was used to identify the correlation effects amongst the available set of mining attributes. A thorough analysis of various subsets of the available mining attributes was performed to identify a subset that is highly correlated with the predicted mining attribute and at the same time has low correlation scores within the subset in itself. Since some Data Mining algorithms (like Naïve Bayes, Neural Net) are quite sensitive to interattribute dependencies, an attempt was made to outline the clusters of mutually dependent attributes, with a certain degree of success. From each cluster a single, most convenient, attribute was selected. For this case study, an expert made the decision, but this process can be generalized to a large degree. An optimal set of five mining attributes was identified after this exercise. Table 4 shows the list of identified mining attributes, grouped by the input attribute type categorical or numerical. Categorical Type FARecieved AppStatus Applicant Type Numerical Type StudBudget TotalAward Table 4: Optimal set of mining attributes identified At this stage we have: Designed an Analytic Services cube Loaded it with relevant data Identified the optimal subset of measures (mining attributes) modeling the problem We will now use the Data Mining Framework to define an appropriate model (for the business problem) based on the Analytic Services cube and the identified subset of mining attributes (measures). Setting up the model includes selecting the algorithm, defining algorithm parameters and identifying the input data location and output data location for the algorithm. choosing the algorithm The next step in the Data Mining process is to pick the appropriate algorithm. There are a set of six basic algorithms provided in the Data Mining Framework Naïve Bayes, Regression, Decision Tree, Neural Network, Clustering and Association Rules. The Analytic Services Data Mining Framework also allows for the inclusion of new algorithms through a well defined process described in the vendor guide that is part of the Data Mining SDK. The six basic algorithms are a sample set that is shipped with the product to provide a starting point for using the Data Mining Framework. Choosing an algorithm for a specific problem needs basic knowledge of the problem domain and the applicability of specific mathematical techniques to efficiently solve problems in that domain. The specific problem that is being discussed in this paper falls into a class of problems termed as classification problems. The need here is to classify each applicant into a discrete set of classes on the basis of certain numerical and categorical information available about the applicant. The class referred to in this context is the status of the applicants application looked at from an enrollment perspective: will enroll or will not enroll. There is historical data available indicating which kind (with a specific combination of categorical and numerical factors associated with them) of applicants that have gone ahead and accepted offers from the ABC University and subsequently enrolled into the programs. There is data available for the negative case as well i.e. applicants that did not eventually enroll into the program. 5

6 Given the fact that this problem can be looked at as a classification problem and the fact that there is historical information available, one of the algorithms that is suitable for the analysis is the Naïve Bayes classification algorithm. We chose Naïve Bayes for modeling this particular business problem. deciding on the algorithm parameters Every algorithm has a set of parameters that control the behavior of the algorithm. Algorithm users need to choose the parameters based on their knowledge of the problem domain and the characteristics of the input data. Analytic Services provides adequate support for such preliminary analysis of data using Hyperion Visual Explorer or the Analytic Services Spreadsheet Client. Users are free to analyze the data using any tool convenient and determine their choices for the various algorithm parameters. Each of the algorithms has a set of parameters that determine the way the algorithm will process the input data. For the current case, the algorithm chosen is Naïve Bayes and it has four parameters that need to be specified Categorical, Numerical, RangeCount, Threshold. The details of each of the parameters and the implications of setting them are described in the online help documentation. Out of the selected list of attributes we have a few that are of categorical type and hence our choice for the Categorical parameter is a yes. Similarly, there are attributes that are of numerical type and hence the choice for Numerical parameter also is a yes. The data was analyzed using a histogram plot to understand the distribution before deciding on the value to be provided for the RangeCount parameter. This parameter needs to be large enough to allow for the algorithm to use all the variety available in the data and at the same time should be small enough to prevent over fitting. From the analysis of the input data for this particular case, setting this parameter 12 seemed reasonable. The RangeCount controls the binning 1 process in the algorithm. It should be emphasized that the binning schemes (including bin count) really depend on the specific circumstances and may vary to a great degree between different problems. At this stage we have: Designed an Analytic Services cube Loaded it with relevant data Identified the optimal subset of measures (mining attributes) Chosen the algorithm suitable for the problem Identified the parameter values for the chosen algorithm applying the data mining framework Now that we have completed all the preparatory steps for Data Mining, the next step is to use the Data Mining Wizard in the Administration Services Console to build a Data Mining model for the business problem. There are three steps involved in effectively using the Data Mining functionality to provide predictive solutions to business problems. 1. Building the Data Mining model 2. Testing the Data Mining model 3. Applying the Data Mining model Each of these steps, performed using the Data Mining Wizard in the Administration Services Console, uses MDX expressions to define the context within the cube to perform the data mining operation. Various accessors, specified as MDX expressions, identify data locations within the cube. The framework uses the data in the locations as input to the algorithm or writes output to the specified location. Accessors need to be defined for each of the algorithms so as to let the algorithm know specific contexts for each of the following: (the attribute domain) the expression to identify the factors of our analysis that will be used for prediction [In the current context this expression pertains to the mining attributes that we identified] (the sequence domain) the expression to identify the cases/records that need to be analyzed [In the current context this expression will identify the list of applicants] (the external domain) the expression to identify if multiple models need to be built [Not relevant in the current context] (the anchor) the expression to specify the additional restrictions from dimensions that are not really participating in this data mining operation [In the current context all the dimensions of the cube that we used have relevance to the problem. Accordingly, the anchor in the current context only helps restrict the algorithm scope to the right measure in the Measures dimension] Additional details for each of these expressions can be obtained from the online help documentation. building the data mining model To access the Data Mining Framework, you will need to bring up the Data Mining Wizard in the Administration Services Console, and choose the appropriate application and database as shown in Figure 1 on the next page. 6

7 Figure 1: Choosing the application and database In the next screen (Figure 2 below), depending on whether you are building a new model or revising an existing model, you choose the appropriate task option. Figure 2: Creating a Build Task 7

8 Figure 3: Settings to handle missing data This will bring up the wizard screen for setting the algorithm parameters and the accessor information associated with the chosen algorithm, in this case Naïve Bayes. The user will select a node in the left pane to see and provide values for the appropriate options and fields displayed in the right pane. As shown in Figure 3, select Choose mining task settings to set how to handle missing data in the cube. The choice in this case is to replace with As NaN (Not-A-Number). The Naïve Bayes algorithm requires that we declare upfront if we plan to use either or both of Categorical and Numerical predictors. In the context of the current case, we have both categorical and numerical attribute types and hence the choice is True for both these parameters. RangeCount was decided at 12. Threshold was fixed at 1e-4, a very small value. Figure 4 shows the completed screen for the parameters setting. Figure 4: Setting parameters 8

9 The Naïve Bayes algorithm has two predictor accessors Numerical Predictor and Categorical Predictor and one target accessor. Figure 5 shows the various domains that need to be defined for the accessors. Table 5 shows the values that were used for the case being discussed. All the information provided during this stage of model building is preserved in a template file so as to facilitate reuse of the information if necessary. Figure 5: Accessors associated with Naive Bayes algorithm Table 5: Setting up accessors for the build mode while using Naive Bayes algorithm 9

10 Figure 6: Generating the template and model Once the accessors are defined, the Data Mining Wizard will prompt the user to provide names for the template and model that will be generated at this stage. Figure 6 shows the screen in which the model and template names need to be defined. At this stage we have: Built a Data Mining model built using the Naïve Bayes algorithm testing the data mining model The next step will be to test the newly built model to verify that it satisfies the level of statistical significance that is needed for the model to be put to use. Ideally, a part of the input data (with valid known outcomes historical data) will be set aside as a test dataset to verify the goodness of the Data Mining model that is developed by the use of the algorithm. Testing the model on this test dataset and comparing the outcomes predicted by the model against the known outcomes (historical data) is also one among the multiple processes supported by the Data Mining Wizard. A test mode template can be created by a process similar to creating a build mode template as described in the previous section. While building the test mode template the user needs to provide a Confidence parameter to let the Data Mining Framework know the minimum confidence level necessary to declare the model as a valid one. We specified a value of 0.95 for the Confidence parameter. The exact steps in the wizard and descriptions of the various parameters can be obtained from the online help documentation. 10

11 Once the process is completed the results of the test appear (the name of which was specified in the last step of the Data Mining Wizard) against the Model Results node. Figure 7 shows the node in the Administration Services Console Enterprise View pane where the Mining Results node is visible. The model can be queried within the Administration Services Console interface to obtain a list of the model accessors by using the Query Result functionality. Invoking Show Result for the Test accessor will indicate the result of the test. Figure 8 below shows the list of model accessors in the result set of a model based on the Naïve Bayes algorithm used in the test mode. If the Test accessor has a value 1.0 then the test is deemed successful and the model is declared good or valid for prediction. Figure 9 shows the result of test for the case being discussed in this paper. At this stage we have: Built a Data Mining model built using the Naïve Bayes algorithm The model has been verified as valid with 95% confidence Figure 7: Model Results node in the Administration Services Console interface Figure 8: Model accessors for result set associated with a model based on Naive Bayes algorithm Figure 9: Test results 11

12 applying the data mining model The intent at this stage is to use the recently constructed Data Mining model to predict whether new applicants are likely to enroll into the program. Using the Data Mining model in the apply mode is similar to the earlier two steps. The Data Mining Wizard guides the user to provide the parameters appropriate to the apply mode. The Target domain is usually different in the apply mode since data is written back to the cube. The details of the various accessors and the associated domains can be obtained from the online help documentation. Table 6 shows the values that were provided to the Data Mining Wizard to use the model in the apply mode. Just as in the build mode the names of the results model and template are specified in the wizard and the template is saved before the model is executed. The results of the prediction are written into the location specified by the Target accessor The mining attribute that is referred to by the MDX expression: {[ActualStatus]}. The results can be visualized either by querying the model results in the Administration Services Console using the Query Result functionality as described in the previous section, or by accessing the cube and reviewing the data written back to the cube. One of the options to view the results will be to use the Analytical Services Spread Sheet Client to connect to the database and view the cube data for the ActualStatus measure. interpreting the results The results of the Data Mining model need to be interpreted in the context of the business problem that it is attempting to solve. Any transformation done to the input measures need to be appropriately adjusted for while attempting to interpret the results. In the context of the case being discussed in this paper, the intent was to predict whether applicants were likely to enroll at the ABC University. The possible outcomes in this case are either the applicant will enroll or the applicant will not enroll. The model was verified against the entire set of available data (over records). the confusion matrix You can construct a confusion matrix by listing the false positives and false negatives in a tabular format. A false positive happens when the model predicts that an applicant will enroll and in reality the applicant does not enroll. A false negative happens when the model predicts that an applicant will not enroll and in reality the applicant does enroll. The results predicted by the model can be compared with the actual outcome as available in the historical data to build the confusion matrix. In general for such classification problems, it is most likely that one of these ( false positives or false negatives ) will be slightly more important than the other in a business context. In the case being discussed in this paper, a false negative means lost revenue, whereas a false positive Table 6: Setting up accessors for the apply mode while using Naive Bayes algorithm 12

13 means additional promotional expenditure in trying to follow up on an applicant who will eventually not enroll. The importance of each should be analyzed in the context of the business and the model needs to be rebuilt if necessary with a different training set (historical data) or with a different set of attributes. Figure 10 below shows the confusion matrix constructed using the data set that was analyzed as part of this case study. It is evident from the confusion matrix that the model predicted that 1550 ( ) students will enroll. Of that, only 1478 actually enrolled and 72 did not enroll. This implies that there were 72 false positives. Similarly, the model predicted that 9805 ( ) students will not enroll. Of that, only 9356 actually did not enroll, whereas 449 actually did enroll. This implies that there were 449 false negatives. Figure 10: Confusion matrix to analyze the model s effectiveness in prediction analyzing the results On further analysis of the results the following observations can be made: Incorrect Predictions False positives False negatives Total # of Cases Percentage of Cases 0.634% 3.954% 4.59% Success rate of the model: 95.41% (only 521 incorrect predictions in cases) additional functionality The Analytic Services Data Mining Framework offers more functionality that can be used when deploying models in real business scenarios. Some of the further steps that can be considered include: transformations The Data Mining Framework also offers the ability to apply a transform to the input data just before it is presented to the algorithm. Similarly, the output data can be transformed before being written into the Analytic Services cube. The Data Mining Framework offers a basic list of transformations exp, log, pow, scale, shift, linear that can be used through the Data Mining Wizard. The details of each of these transformations, what they do and how to use them can be obtained from the Analytic Services online help documentation. This list of transformations is further extensible through the import of custom Java routines written specifically for the purpose. The details of how to write Java routines to be imported as additional transforms can be obtained from the vendor guide that is shipped as part of the Data Mining SDK mapping In some cases when the model has been developed for a different context and needs to be used elsewhere, the Mapping functionality is useful. Through this functionality the user can provide information to the Data Mining Framework on how to interpret the existing model accessors in the new context in which it is being deployed. More information on using this functionality can be obtained from the online help documentation. import/export of pmml models The Data Mining Framework allows for portability through import and export of mining models using the PMML format. setting up models for scoring The Data Mining models built using the Analytic Services Data Mining Framework can also be set up for scoring. In the scoring mode the user interacts with the model at real time and the results are not written to the database. The input data can either be sourced from the cube or through data templates which the user fills up during execution. The scoring mode of deployment can be combined with custom applications built using developer tools provided by Hyperion Application Builder to make applications that cater to a specific business process while leveraging powerful predictive analytic capability from the Analytic Services Data Mining Framework. The online help documentation provides additional details on how to score a Data Mining model. using the data mining framework in batch mode There is also a batch mode interface to access the functionalities provided in the Data Mining Framework. Scripts written using the MaxL command interface can be used to do almost all the functionality that is exposed through the Data Mining Wizard. Details of the MaxL commands and their usage can be obtained from the online help documentation. building custom applications Custom applications can be developed using Analytic Services as the backend database and developer tools provided along with Hyperion Application Builder. The functionality provided by the Data Mining Framework can be invoked through APIs. 13

14 summary Data Mining is one of the functional groups among the comprehensive enterprise class analytic functionalities offered within Analytic Services. This case study focused on using the Naïve Bayes algorithm to solve a classification problem, modeled using a real life data set. It was possible to get a 95.41% success rate in the classification exercise using the Analytic Services Data Mining Framework. Some of the business benefits of Data Mining in the OLAP context that can be illustrated from the current case include: It can serve as a discovery tool in a critical decisionsupport process. It includes evaluation of the critical parameters affecting the outcome of a customer (applicant) behavior. The ABC University had initially assumed that some time-related factors played a stronger role in influencing the judgment to enroll. The Data Mining exercise proved it not to be true. In fact, some other, financial attributes appeared as number one. The successful prediction mechanism can become a base for a full-blown risk-management application. In case of ABC University, again, they can devise a policy to invest more promotional expenditure in tracking applicants with distinctly higher academic credentials but with moderate probability of enrollment. Similarly, the prediction mechanism can help the admissions department in making decisions on admission offers even before they have seen the entire applicant pool. Operational control and reporting tool. Traditional OLAP reporting can provide visibility into the state of the admissions operations, extent of funds utilization and reporting on various other financial/operational indicators; in all providing better control on the conformance between planned and actual business positions. suggested reading 1. Data Mining: Concepts and Techniques Jiawei Han, Micheline Kamber 2. Data Mining Techniques: For Marketing, Sales, and Customer Relationship Management Michael J. A. Berry, Gordon S. Linoff. 3. Data Mining Explained Rhonda Delmater, Jr., Monte Hancock 4. Data Mining: A Hands-On Approach for Business Professionals (Data Warehousing Institute Series) Robert Groth footnote 1 Breaking up a continuous range of data into discrete segments / bins. HYPERION SOLUTIONS CORPORATION WORLDWIDE HEADQUARTERS 5450 GREAT AMERICA PARKWAY SANTA CLARA, CA TEL FAX Copyright 2005 Hyperion Solutions Corporation. All rights reserved. Hyperion, the Hyperion H logo, and Hyperion s product names are trademarks of Hyperion. References to other companies and their products use trademarks owned by the respective companies and are for reference purpose only. 5164_0805