Database Marketing, Business Intelligence and Knowledge Discovery Note: Using material from Tan / Steinbach / Kumar (2005) Introduction to Data Mining,, Addison Wesley; and Cios / Pedrycz / Swiniarski / Kurgan (2007) Data Mining: A Knowledge Discovery Approach,, Springer. 1
Database Marketing Database marketing is a form of direct marketing using databases of customers or potential customers to generate personalized communications in order to promote a product or service for marketing purposes. The distinction between direct and database marketing stems primarily from the attention paid to the analysis of data. Database marketing emphasizes the use of statistical techniques to develop models of customer behavior, which are then used to select customers for communications. 2
Database Marketing Classic database marketing Customer list (in-house or bought) Simple model based on past data E-mails, coupons, offers Database marketing 2.0 Integrated data source (internal, external) and warehouses Complex models (data mining, social network analysis) Communication channels include social media, direct web interactions (recommender systems), and many more 3
Business Intelligence Encompasses architectures, tools, applications, databases and methodologies for the collection, integration, analysis, and presentation of business information. The purpose of business intelligence is to support better business decision making. 4
BI Components and Architecture 5
Transactional vs. Analytical Data Processing Transactional processing takes place in operational systems that provide the organization with the capability to perform business transactions and produce transaction reports. This is done primarily for fast and efficient processing of routine, repetitive data. Supplementary activity to transaction processing is called analytical processing, which involves the analysis of accumulated data. Analytical processing, sometimes referred to as business intelligence, includes data mining, decision support systems (DSS), querying, and other analysis activities. These analyses place strategic information in the hands of decision makers to enhance productivity and make better decisions, leading to greater competitive advantage. 6
Business Analytics Business analytics is how organizations gather and interpret data in order to make better business decisions and to optimize business processes. In businesses, analytics (alongside data access and reporting) represents a subset of business intelligence (BI). Analytics are defined as the extensive use of data, statistical and quantitative analysis, explanatory and predictive modeling, and fact-based decision-making. Analytics may be used as input for human decisions, but there are also examples of fully automated decisions that require minimal human intervention. 7
Business Analytics 8
Knowledge Discovery The process of automatically searching large volumes of data for patterns that can be considered knowledge about the data Evolutionary stage Business question enabling technologies characteristic Data collection (1980s) What was my total revenue in the last 5 years? Computers,tapes, disks Retrospective, static data delivery Data access (1980s) What were unit sales in new England last March? Relational databases (RDBMS), structured query language (SQL) Retrospective, dynamic data delivery at record level Data warehousing and decision support (early 1990s) What were the sales in region A by product, by salesperson? OLAP, multidimensional databases, data warehouses Retrospective, proactive data delivery at multiple level Intelligent data mining (late 1990s) What s likely to happen to the Boston unit s sales next month? Why? Advanced algorithms, multiprocessor computers, massive databases Prospective, proactive information delivery Advanced intelligent systems; complete integration (2000-2004) What is the best plan to follow? How did we perform compared to metrics? Neural computing advanced Al models, complex optimization, web services Proactive, integrative ; multiple business partners 9
Data Mining Non-trivial extraction of implicit, previously unknown and potentially useful information from data Exploration & analysis, by automatic or semiautomatic means, of large quantities of data in order to discover meaningful patterns Prediction Methods: Use some variables to predict unknown or future values of other variables. Description Methods: Find human-interpretable patterns that describe the data. 10
Text Mining The application of data mining to non- structured or less-structured text files. Text mining helps organizations to do the following (1) find the hidden content of documents, including additional useful relationship and (2) group documents by common themes (e.g., identity all the customers of an insurance firm who have similar complaints). 11
Web Mining The application of data mining techniques to discover actionable and meaningful patterns, profiles, and trends from web resources. Web mining is used in the following areas: information filtering, mining of web- access logs for analyzing usage, assisted browsing,... 12
Data Life Cycle Process 13
Knowledge Discovery Process The knowledge discovery process (KDP) forms the overall process for extracting new knowledge from data. a sequence of steps (with feedback loops) that should be followed to discover new knowledge (e.g. patterns) a well-defined KDP model is a logical, cohesive, wellthought-out structure and approach that is presented to decision-makers who may have difficulty understanding the need, value, and mechanics behind a KDP to ensure the end product is useful for the user/owner of the data KD projects require a significant project management effort that needs to be grounded in a solid framework KD should follow other disciplines that have established models 14
Knowledge Discovery Process KDP is defined as the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data: consists of many steps (one is Data Mining), each attempting at the completion of a particular discovery task, and accomplished by the application of a DM method concerns the entire KD process, including how the data is stored and accessed, how to use efficient and scalable algorithms to analyze large datasets, how to interpret and visualize the results, and how to model and support interaction between human and machine concerns support for learning and analyzing the application domain 15
Overview of the Knowledge Discovery Process consists of multiple steps, which are executed in a sequence the next step is initiated upon successful completion of the previous step, and requires the result generated by the previous step as its input. it stretches between the task of understanding the project domain and data, through data preparation and analysis, to evaluation, understanding and application of the generated results it is iterative, i.e. includes feedback loops that are triggered by revisions Input data (database, images, video, semi-structured data, etc.) STEP 1 STEP 2 STEP n- 1 STEP n Knowledge (patterns, rules, clusters, classification, associations, etc.) 16
Knowledge Discovery Process Models Popular KDP models include Nine-step model by Fayyad and colleagues academic CRISP-DM (CRoss-Industry Standard Process for Data Mining) model industrial Six-step KDP model by Cios and colleagues hybrid (academic/industrial) 17
Knowledge Discovery Process Models Nine-step model by Fayyad and colleagues Developing and Understanding of the Application Domain It includes learning the relevant prior knowledge, and the goals of the end-user of the discovered knowledge. Creating a Target Data Set It selects a subset of variables (attributes) and data points (examples), which will be used to perform discovery tasks. It usually includes querying the existing data to select the desired subset. Data Cleaning and Preprocessing It consists of removing outliers, dealing with noise and missing values in the data, and accounting for time sequence information and known changes. Data Reduction and Projection It consists of finding useful attributes by applying dimension reduction and transformation methods, and finding invariant representation of the data. 18
Knowledge Discovery Process Models Choosing the Data Mining Task It matches the goals defined in step 1 with a particular DM method, such as classification, regression, clustering, etc. Choosing the Data Mining Algorithm It selects methods for searching patterns in the data, and decides which models and parameters of the used methods may be appropriate. Data Mining It generates patterns in a particular representational form, such as classification rules, decision trees, regression models, trends, etc. Interpreting Mined Patterns It usually involves visualization of the extracted patterns and models, and visualization of the data based on the extracted models. Consolidating Discovered Knowledge It consists of incorporating the discovered knowledge into the performance system, and documenting and reporting it to the interested parties. It also may include checking and resolving potential conflicts with previously believed knowledge. 19
Knowledge Discovery Process Models CRISP-DM (CRoss-Industry Standard Process for Data Mining) model designed in late 1990s by four companies: Integral Solutions Ltd. (provider of commercial Data Mining solutions), NCR (database provider), Daimler Chrysler (automobile manufacturer), and OHRA (insurance company) CRISP-DM Special Interest Group was created to support the developed process model it includes over 300 users and tool/service providers the model consists of six steps 20
Knowledge Discovery Process Models CRISP-DM model Business Understanding It focuses on understanding objectives and requirements from a business perspective. It also converts them into a DM problem definition, and designs a preliminary project plan to achieve the objectives. It is further broken into several sub-steps: determination of business objectives assessment of situation determination of DM goals, and generation of project plan. Data Understanding It starts with an initial data collection and familiarization with the data. Specific aims include identification of data quality problems, discovery of initial insights into the data, and detection of interesting data subsets. It is further broken down into: collection of initial data description of data exploration of data, and verification of data quality 21
Knowledge Discovery Process Models CRISP-DM model Data Preparation It covers all activities to construct the final dataset, which constitutes the data that will be fed into DM tool(s) in the next step. It includes table, record, and attribute selection, data cleaning, construction of new attributes, and data transformation. This step is divided into: selection of data cleansing of data construction of data integration of data, and formatting of data sub-steps. 22
Knowledge Discovery Process Models CRISP-DM model Modeling It selects and applies various modeling techniques. It usually involves use of several methods for the same DM problem type, and calibration of their parameters to optimal values. Since some methods may require a specific format for input data, often reiteration into the previous step is necessary. This step is subdivided into: selection of modeling technique(s) generation of test design creation of models, and assessment of generated models. 23
Knowledge Discovery Process Models CRISP-DM model Evaluation After building one or more models that have high quality from a data analysis perspective, the model is evaluated from business objective perspective. The model is thoroughly evaluated, and review of the steps executed to construct the model is performed. A key objective is to determine if there are important business issues that have not been sufficiently considered. At the end of this phase, a decision on the use of the DM results should be reached. The key sub-steps in this step include: evaluation of the results process review, and determination of the next step. 24
Knowledge Discovery Process Models CRISP-DM model Deployment It involves organization and presentation of the discovered knowledge in a way that the customer can use. Depending on the requirements, this can be as simple as generating a report or as complex as implementing a repeatable KDP. This step is further divided into: planning of the deployment planning of the monitoring and maintenance generation of final report, and review of the process sub-steps. 25
Knowledge Discovery Process Models CRISP-DM model is characterized by an easy to understand vocabulary and good documentation acknowledges the strong iterative nature of the process with loops between several of the steps successful and extensively applied model, which is mainly because of its grounding in practical, industrial, real-world Knowledge Discovery experience 26
Knowledge Discovery Process Models Six-step model by Cios and colleagues developed based on the CRISP-DM model by adopting it to academic research; main differences and extensions include: providing more general, research-oriented description of the steps introducing the Data Mining step instead of the Modeling step introducing several new explicit feedback mechanisms. The CRISP-DM model has only three major feedback sources, while this model has more detailed feedback mechanisms modification of the last step; the discovered for a particular domain may be applied in other domains includes six steps 27
Knowledge Discovery Process Models Six-step model Understanding of the Problem Domain Understanding of the Data input data (database, images, video, semistructured data, etc.) Preparation of the Data Data Mining Evaluation of the Discovered Knowledge knowledge (patterns, rules, clusters, classifica- -tion, associations, etc.) Use of the Discovered Knowledge Extend knowledge to other domains 28
Knowledge Discovery Process Models Six-step model by Cios and colleagues Understanding of the Problem Domain It involves working closely with domain experts to define the problem and determine the project goals, identifying key people, and learning about current solutions to the problem. It also involves learning domainspecific terminology. A description of the problem, including its restrictions, is prepared. Finally, project goals are translated into the DM goals and initial selection of DM tools to be used later in the process is performed. Understanding of the Data It includes collection of sample data and deciding which data, including its format and size, will be needed. Background knowledge can be used to guide these efforts. Data is checked for completeness, redundancy, missing values, plausibility of attribute values, etc. Finally, the step includes verification of the usefulness of the data in respect to the DM goals. 29
Knowledge Discovery Process Models Preparation of the Data It concerns deciding which data will be used as input for DM methods in the next step. It involves sampling, running correlation and significance tests, data cleaning that includes checking completeness of data records, removing or correcting for noise and missing values, etc. The cleaned data may be further processed by feature selection and extraction algorithms (to reduce dimensionality), by derivation of new attributes (say by discretization), and by summarization of data (data granularization). The end results are data that meet specific input requirements for the selected in step 1 DM tools. Data Mining It involves using various DM methods to derive knowledge from preprocessed data. 30
Knowledge Discovery Process Models Evaluation of the Discovered Knowledge It includes understanding the results, checking whether the discovered knowledge is novel and interesting, interpreting of the results by domain experts, and checking the impact of the discovered knowledge. Only the approved models are retained and the entire process is revisited to identify which alternative actions could have been taken to improve the results. A list of errors made in the process is prepared. Use of the Discovered Knowledge It consists of planning where and how the discovered knowledge will be used. The application area in the current domain may be extended to other domains. A plan to monitor the implementation of the discovered knowledge is created and the entire project documented. Finally the discovered knowledge is deployed. 31
Knowledge Discovery Process Models Six-step model by Cios and colleagues this model identifies and describes explicit feedback loops from Understanding of the Data to the Understanding of the Problem Domain step; the loop is caused by needing additional domain knowledge to better understand the data from the Preparation of the Data to the Understanding of the Data step; the loop is caused by need for additional or more specific information about the data to guide the choice of data preprocessing algorithms from the Data Mining to the Understanding of the Problem Domain step; the reason could be unsatisfactory results generated by selected DM methods, requiring modification of the project s goals from the Data Mining to the Understanding of the Data step; the most common reason is poor understanding of the data, which results in incorrect selection of DM method and its subsequent failure 32
Knowledge Discovery Process Models from the Data Mining to the Preparation of the Data step; the loop is caused by need to improve data preparation. This is often caused by the specific requirements of the used DM method, which may have not been known during the Data Preparation step, from the Evaluation of the Discovered Knowledge to the Understanding of the Problem Domain step; the most common cause is invalidity of the discovered knowledge. Several possible reasons include incorrect understanding or interpretation of the domain, incorrect design or understanding of problem restrictions, requirements, or goals from the Evaluation of the Discovered Knowledge to the Data Mining; this loop is executed when the discovered knowledge is not novel, interesting, or useful. The least expensive solution is to choose a different DM tool and repeat the DM step. 33
Comparison of Knowledge Discovery Process Models Model domain of origin # steps Steps Fayyad et al. academic 9 1. Developing and Understanding of the Application Domain 2. Creating a Target Data Set Cios et al. hybrid (academic/industry) 6 1. Understanding of the Problem Domain 2. Understanding of the Data CRISP-DM industry 6 1. Business Understanding 2. Data Understanding Notes supporting software 3. Data Cleaning and Preprocessing 4. Data Reduction and Projection 5. Choosing the Data Mining Task 6. Choosing the Data Mining Algorithm 7. Data Mining 8. Interpreting Mined Patterns 9. Consolidating Discovered Knowledge the most popular model; provides detailed technical description with respect to data analysis, but lacks business aspects commercial system MineSet TM 3. Preparation of the Data 4. Data Mining 5. Evaluation of the Discovered Knowledge 6. Use of the Discovered Knowledge draws from both academic and industrial models; emphasizes iterative aspects; identifies and describes explicit feedback loops N/A 3. Data Preparation 4. Modeling 5. Evaluation 6. Deployment uses easy to understand vocabulary; has good documentation; commercial system Clementine reported application domains medicine, engineering, production, e-business, software medicine, software medicine, engineering, marketing, sales 34
Comparison of the Knowledge Discovery Process Models A very important aspect of the KDP is the relative time spent to complete each of the steps it enables precise scheduling estimates proposed by both researchers and practitioners are shown below specific estimated values depend on many factors, such as existing knowledge about the considered project domain, skills level of human resources, complexity of the problem, etc. data preparation step is by far the most time consuming step relative effort [%] 70 60 Cabena et al. estimates Shearer estimates Cios and Kurgan estimates 50 40 30 20 10 0 Understanding of Domain Understanding of Data Preparation of Data Data Mining Evaluation of Results Deployment of Results KDDM steps 35