HP NonStop SQL/MX Data Mining Guide

Transcription

1 HP NonStop SQL/MX Data Mining Guide Abstract This manual presents a nine-step knowledge-discovery process, which was developed over a series of data mining investigations. This manual describes the data structures and operations of the NonStop SQL/MX approach and implementation. Product Version NonStop SQL/MX Release 2.0 Supported Release Version Updates (RVUs) This publication supports G06.23 and all subsequent G-series releases until otherwise indicated by its replacement publication. Part Number Published April 2004

2 Document History Part Number Product Version Published NonStop SQL/MX Release 1.0 February NonStop SQL/MX Release 2.0 April 2004

3 HP NonStop SQL/MX Data Mining Guide Index Figures Tables What s New in This Manual iii Manual Information iii New and Changed Information iii About This Manual v Audience v Organization v Related Documentation vi Notation Conventions viii 1. Introduction The Traditional Approach 1-1 The SQL/MX Approach 1-2 Data-Intensive Computations Performed in the DBMS 1-2 Use of Built-In DBMS Data Structures and Operations 1-2 The Knowledge Discovery Process 1-3 Defining the Business Opportunity 1-4 Preparing the Data 1-7 Creating the Mining View 1-10 Mining the Data 1-10 Knowledge Deployment and Monitoring Preparing the Data Loading the Data 2-2 Creating the Database 2-2 Importing Data Into the Database 2-2 Profiling the Data 2-2 Cardinalities and Metrics 2-3 Transposition 2-3 Quick Profiling 2-5 Defining Events 2-6 Aligning the Data 2-6 Deriving Attributes 2-9 Hewlett-Packard Company i

4 Contents 2. Preparing the Data (continued) 2. Preparing the Data (continued) Moving Metrics 2-9 Rankings Creating the Data Mining View Creating the Single Table 3-2 Pivoting the Data Mining the Data Building the Model 4-2 Building Decision Trees 4-2 Checking the Model 4-9 Applying the Model to the Mining Table 4-10 Applying the Model to the Database 4-10 Deploying the Model 4-10 Monitoring Model Performance 4-11 A. Creating the Data Mining Database B. Inserting Into the Data Mining Database C. Importing Into the Data Mining Database Index Importing Customers Data C-1 Customers Format File C-1 Customers Data File C-1 Importing Account History Data C-2 Account History Format File C-3 Account History Data File C-3 Figures Tables Figure 4-1. Initial Branches of Decision Tree 4-4 Figure 4-2. Decision Tree for Divorced Branch 4-5 Figure 4-3. Decision Tree for Single Branch 4-6 Figure 4-4. Final Decision Tree 4-9 Table i. Manual Organization v ii

5 What s New in This Manual Manual Information Abstract HP NonStop SQL/MX Data Mining Guide This manual presents a nine-step knowledge-discovery process, which was developed over a series of data mining investigations. This manual describes the data structures and operations of the NonStop SQL/MX approach and implementation. Product Version NonStop SQL/MX Release 2.0 Supported Release Version Updates (RVUs) This publication supports G06.23 and all subsequent G-series releases until otherwise indicated by its replacement publication. Part Number Published April 2004 Document History Part Number Product Version Published NonStop SQL/MX Release 1.0 February NonStop SQL/MX Release 2.0 April 2004 New and Changed Information This publication has been updated to reflect new product names: Since product names are changing over time, this publication might contain both HP and Compaq product names. Product names in graphic representations are consistent with the current product interface. The technical content of this guide has been updated and reflects the state of the product at the G06.23 RVU. Previous versions of the guide used the Object Relational Data Mining (ORDM) approach and architecture. ORDM advocates performing data mining and other parts of the knowledge discovery process against data in the SQL/MX data base. This technique has been updated. Readers are encouraged to perform the data iii

6 What s New in This Manual New and Changed Information preparation steps in SQL/MX but reserve the mining or model building for UNIX or Microsoft Windows platforms. All sections of the manual have been updated to reflect the impact of major changes of SQL/MX Release 2.0 (for example, the introduction of SQL/MX tables). Introductions to the data preparation steps have been revised and rewritten. The DDL statements in Appendix A, B, and C have been updated to use SQL/MX DDL syntax. Appendix A syntax has been removed. Readers can consult the SQL/MX Reference Manual for the most current syntax and examples. Index entries have been added, updated, and corrected. iv

7 About This Manual This manual presents a nine-step knowledge discovery process, which was developed over a series of data mining investigations. This manual describes the data structures and operations of the NonStop SQL/MX approach and implementation. Audience This manual is intended for database administrators and application programmers who are using NonStop SQL/MX to solve data mining problems, either through the SQL conversational interface or through embedded SQL programs. Organization The sections listed in Table i describe the knowledge discovery process (or the data mining process) and present examples that carry out the process. The appendixes listed in Table i provide the syntax for the data mining features of NonStop SQL/MX and the SQL scripts that create the data mining database used in the examples. Table i. Manual Organization Section Section 1, Introduction Section 2, Preparing the Data Section 3, Creating the Data Mining View Section 4, Mining the Data Appendix A, Creating the Data Mining Database Appendix B, Inserting Into the Data Mining Database Appendix C, Importing Into the Data Mining Database Description Presents an overview of the knowledge discovery process and the SQL/MX approach to this process. Defines the example business opportunity used in this manual. Describes the data preparation steps of the knowledge discovery process. Describes how to create the mining view. Describes the data mining steps of the knowledge discovery process. Contains DDL statement scripts that you can use to create the data mining database used in the examples in this manual. Contains INSERT statement scripts that you can use to populate the data mining database used in this manual. Contains IMPORT statement scripts that you can use to create the data mining database used in this manual. v

8 About This Manual Related Documentation Related Documentation This manual is part of the SQL/MX library of manuals, which includes: Introductory Guides SQL/MX Comparison Guide for SQL/MP Users SQL/MX Quick Start Reference Manuals SQL/MX Reference Manual SQL/MX Connectivity Service Command Reference DataLoader/MX Reference Manual SQL/MX Messages Manual SQL/MX Glossary Programming Manuals SQL/MX Programming Manual for C and COBOL SQL/MX Programming Manual for Java SQL/MX Guide to Stored Procedures in Java Describes SQL differences between SQL/MP and SQL/MX. Describes basic techniques for using SQL in the SQL/MX conversational interface (MXCI). Includes information about installing the sample database. Describes the syntax of SQL/MX statements, MXCI commands, functions, and other SQL/MX language elements. Describes the SQL/MX administrative command library (MACL) available with the SQL/MX conversational interface (MXCI). Describes the features and functions of the DataLoader/MX product, a tool to load SQL/MX databases. Describes SQL/MX messages. Defines SQL/MX terminology. Describes how to embed SQL/MX statements in ANSI C and COBOL programs. Describes how to embed SQL/MX statements in Java programs according to the SQLJ standard. Describes how to use stored procedures that are written in Java within SQL/MX. vi

9 About This Manual Related Documentation Specialized Guides SQL/MX Installation and Management Guide SQL/MX Query Guide SQL/MX Data Mining Guide SQL/MX Queuing and Publish/Subscribe Services SQL/MX Report Writer Guide SQL/MX Connectivity Service Manual Online Help The SQL/MX Online Help consists of: Reference Help Messages Help Glossary Help NSM/web Help Describes how to plan, install, create, and manage an SQL/MX database. Explains how to use installation and management commands and utilities. Describes how to understand query execution plans and write optimal queries for an SQL/MX database. Describes the SQL/MX data structures and operations to carry out the knowledge-discovery process. Describes how SQL/MX integrates transactional queuing and publish/subscribe services into its database infrastructure. Describes how to produce formatted reports using data from a NonStop SQL/MX database. Describes how to install and manage the SQL/MX Connectivity Service (MXCS), which enables applications developed for the Microsoft Open Database Connectivity (ODBC) application programming interface (API) and other connectivity APIs to use SQL/MX. Overview and reference entries from the SQL/MX Reference Manual. Individual messages grouped by source from the SQL/MX Messages Manual. Terms and definitions from the SQL/MX Glossary. The following manuals are part of the SQL/MP library of manuals and are essential references for information about SQL/MP Data Definition Language (DDL) and SQL/MP installation and management: Related SQL/MP Manuals SQL/MP Reference Manual SQL/MP Installation and Management Guide Context-sensitive help topics that describe how to use the NSM/web management tool. Describes the SQL/MP language elements, expressions, predicates, functions, and statements. Describes how to plan, install, create, and manage an SQL/MP database. Describes installation and management commands and SQL/MP catalogs and files. vii

10 About This Manual Notation Conventions This figure shows the manuals in the SQL/MX library: Introductory Guides Programming Manuals SQL/MX Comparison Guide for SQL/MP Users SQL/MX Quick Start SQL/MX Programming Manual for C and COBOL SQL/MX Programming Manual for Java SQL/MX Guide to Stored Procedures in Java Reference Manuals SQL/MX Reference Manual SQL/MX Messages Manual SQL/MX Glossary SQL/MX Connectivity Service Command Reference Data- Loader/MX Reference Manual Specialized Guides SQL/MX Online Help SQL/MX Installation and Management Guide SQL/MX Query Guide SQL/MX Data Mining Guide Reference Help Messages Help SQL/MX Queuing and Publish/ Subscribe Services SQL/MX Report Writer Guide SQL/MX Connectivity Service Manual Glossary Help NSM/web Help VST001.vsd Notation Conventions Hypertext Links Blue underline is used to indicate a hypertext link within text. By clicking a passage of text with a blue underline, you are taken to the location described. For example: viii

11 About This Manual General Syntax Notation This requirement is described under Backup DAM Volumes and Physical Disk Drives on page 3-2. General Syntax Notation This list summarizes the notation conventions for syntax presentation in this manual. UPPERCASE LETTERS. Uppercase letters indicate keywords and reserved words. Type these items exactly as shown. Items not enclosed in brackets are required. For example: MAXATTACH lowercase italic letters. Lowercase italic letters indicate variable items that you supply. Items not enclosed in brackets are required. For example: file-name computer type. Computer type letters within text indicate C and Open System Services (OSS) keywords and reserved words. Type these items exactly as shown. Items not enclosed in brackets are required. For example: myfile.c italic computer type. Italic computer type letters within text indicate C and Open System Services (OSS) variable items that you supply. Items not enclosed in brackets are required. For example: pathname [ ] Brackets. Brackets enclose optional syntax items. For example: TERM [\system-name.]$terminal-name INT[ERRUPTS] A group of items enclosed in brackets is a list from which you can choose one item or none. The items in the list can be arranged either vertically, with aligned brackets on each side of the list, or horizontally, enclosed in a pair of brackets and separated by vertical lines. For example: FC [ num ] [ -num ] [ text ] K [ X D ] address { } Braces. A group of items enclosed in braces is a list from which you are required to choose one item. The items in the list can be arranged either vertically, with aligned ix

12 About This Manual General Syntax Notation braces on each side of the list, or horizontally, enclosed in a pair of braces and separated by vertical lines. For example: LISTOPENS PROCESS { $appl-mgr-name } { $process-name } ALLOWSU { ON OFF } Vertical Line. A vertical line separates alternatives in a horizontal list that is enclosed in brackets or braces. For example: INSPECT { OFF ON SAVEABEND } Ellipsis. An ellipsis immediately following a pair of brackets or braces indicates that you can repeat the enclosed sequence of syntax items any number of times. For example: M address [, new-value ] [ - ] { } An ellipsis immediately following a single syntax item indicates that you can repeat that syntax item any number of times. For example: "s-char " Punctuation. Parentheses, commas, semicolons, and other symbols not previously described must be typed as shown. For example: error := NEXTFILENAME ( file-name ) ; LISTOPENS SU $process-name.#su-name Quotation marks around a symbol such as a bracket or brace indicate the symbol is a required character that you must type as shown. For example: "[" repetition-constant-list "]" Item Spacing. Spaces shown between items are required unless one of the items is a punctuation symbol such as a parenthesis or a comma. For example: CALL STEPMOM ( process-id ) ; If there is no space between two items, spaces are not permitted. In this example, no spaces are permitted between the period and any other items: $process-name.#su-name Line Spacing. If the syntax of a command is too long to fit on a single line, each continuation line is indented three spaces and is separated from the preceding line by a blank line. This spacing distinguishes items in a continuation line from items in a vertical list of selections. For example: ALTER [ / OUT file-spec / ] LINE [, attribute-spec ] x

13 About This Manual Notation for Messages!i and!o. In procedure calls, the!i notation follows an input parameter (one that passes data to the called procedure); the!o notation follows an output parameter (one that returns data to the calling program). For example: CALL CHECKRESIZESEGMENT ( segment-id!i, error ) ;!o!i,o. In procedure calls, the!i,o notation follows an input/output parameter (one that both passes data to the called procedure and returns data to the calling program). For example: error := COMPRESSEDIT ( filenum ) ;!i,o!i:i. In procedure calls, the!i:i notation follows an input string parameter that has a corresponding parameter specifying the length of the string in bytes. For example: error := FILENAME_COMPARE_ ( filename1:length!i:i, filename2:length ) ;!i:i!o:i. In procedure calls, the!o:i notation follows an output buffer parameter that has a corresponding input parameter specifying the maximum length of the output buffer in bytes. For example: error := FILE_GETINFO_ ( filenum!i, [ filename:maxlen ] ) ;!o:i Notation for Messages This list summarizes the notation conventions for the presentation of displayed messages in this manual. Bold Text. Bold text in an example indicates user input typed at the terminal. For example: ENTER RUN CODE?123 CODE RECEIVED: The user must press the Return key after typing the input. Nonitalic text. Nonitalic letters, numbers, and punctuation indicate text that is displayed or returned exactly as shown. For example: Backup Up. lowercase italic letters. Lowercase italic letters indicate variable items whose values are displayed or returned. For example: p-register process-name xi

14 About This Manual Notation for Management Programming Interfaces [ ] Brackets. Brackets enclose items that are sometimes, but not always, displayed. For example: Event number = number [ Subject = first-subject-value ] A group of items enclosed in brackets is a list of all possible items that can be displayed, of which one or none might actually be displayed. The items in the list can be arranged either vertically, with aligned brackets on each side of the list, or horizontally, enclosed in a pair of brackets and separated by vertical lines. For example: proc-name trapped [ in SQL in SQL file system ] { } Braces. A group of items enclosed in braces is a list of all possible items that can be displayed, of which one is actually displayed. The items in the list can be arranged either vertically, with aligned braces on each side of the list, or horizontally, enclosed in a pair of braces and separated by vertical lines. For example: obj-type obj-name state changed to state, caused by { Object Operator Service } process-name State changed from old-objstate to objstate { Operator Request. } { Unknown. } Vertical Line. A vertical line separates alternatives in a horizontal list that is enclosed in brackets or braces. For example: Transfer status: { OK Failed } % Percent Sign. A percent sign precedes a number that is not in decimal notation. The % notation precedes an octal number. The %B notation precedes a binary number. The %H notation precedes a hexadecimal number. For example: % %B %H2F P=%p-register E=%e-register Notation for Management Programming Interfaces This list summarizes the notation conventions used in the boxed descriptions of programmatic commands, event messages, and error lists in this manual. UPPERCASE LETTERS. Uppercase letters indicate names from definition files. Type these names exactly as shown. For example: ZCOM-TKN-SUBJ-SERV xii

15 About This Manual Notation for Management Programming Interfaces lowercase letters. Words in lowercase letters are words that are part of the notation, including Data Definition Language (DDL) keywords. For example: token-type!r.!o. The!r notation following a token or field name indicates that the token or field is required. For example: ZCOM-TKN-OBJNAME token-type ZSPI-TYP-STRING.!r The!o notation following a token or field name indicates that the token or field is optional. For example: ZSPI-TKN-MANAGER token-type ZSPI-TYP-FNAME32.!o xiii

16 About This Manual Notation for Management Programming Interfaces xiv

17 1 Introduction Knowledge discovery is an iterative process involving many query-intensive steps. The challenges of data management in supporting this process efficiently are significant and continue to grow as knowledge discovery becomes more widely used. Data mining identifies and characterizes interrelationships among multiple variables without requiring a data analyst to formulate specific questions. Software tools look for trends and patterns and flag unusual or potentially interesting ones. Because data mining reveals previously unknown information and patterns, rather than proving or disproving a hypothesis, mining enables knowledge discovery rather than just knowledge verification. Knowledge discovery is an iterative process involving many query-intensive steps. The challenges of data management in supporting this process efficiently are significant and continue to grow as knowledge discovery becomes more widely used. This section discusses these approaches to data mining: The Traditional Approach Today, most data mining is performed in the database by using client tools. This approach is limited because important information might be omitted from the data extract. The SQL/MX Approach The SQL/MX approach to knowledge discovery enables you to perform many data intensive tasks in the database itself, rather than on extracts. Examples include statistical sampling, statistical functions, temporal reasoning through sequence functions, cross-table generation, database profiling, and moving-window aggregations. The Knowledge Discovery Process In the SQL/MX approach, fundamental data structures and operations are built into the database management system (DBMS) to support a wide range of knowledge discovery tasks and algorithms. The knowledge discovery process is described as a series of steps that starts with the selection and definition of a business opportunity, continues through data preparation and modeling, and ends with the deployment of the new knowledge. The Traditional Approach Today s traditional knowledge discovery systems consist of an application program on top of a data source. The main emphasis in these systems is data mining inventing new techniques and algorithms, proving their statistical soundness, and validating their effectiveness given a suitable problem. Data should be available in a convenient form, typically a flat file, extracted from an appropriate data source. The knowledge discovery system consists of specific 1-1

18 Introduction The SQL/MX Approach algorithms that load the entire data set into memory and perform necessary computations. The extract approach has two major limitations: It does not scale to large data sets because the entire data set is required to fit in memory. Statistical sampling can be used to avoid this limitation. However, sampling is inappropriate in many situations because sampling might cause patterns to be missed, such as those in small groups or those between records. It cannot conveniently manage multiple versions of data across numerous iterations of a typical knowledge discovery investigation. For example, each iteration might require extracting additional data, performing incremental updates, deriving new attributes, and so on. The SQL/MX Approach In most enterprise organizations today, database systems are crucial for conducting business. The DBMS systems serve as the transaction processing systems for daily operations and manage data warehouses containing huge amounts of historical information. The validated data in these warehouses is already being used for online analysis and is a natural starting point for knowledge discovery. The SQL/MX approach identifies fundamental data structures and operations that are common across a wide range of knowledge discovery tasks and builds such structures and operations into the DBMS. The primary advantages of the SQL/MX technology over traditional data mining techniques include: The ability to mine much larger data sets, not only data in flat-file extracts Simplified data management More complete results Better performance and reduced cycle times The main features of the SQL/MX approach are summarized next. Data-Intensive Computations Performed in the DBMS Tools and applications perform data-intensive data-preparation tasks in the DBMS by using an SQL interface. As a result, you can access the powerful and parallel DBMS data manipulation capabilities in the data preparation stage of the knowledge discovery process. Use of Built-In DBMS Data Structures and Operations Fundamental data structures and operations are built into the DBMS to support a wide range of knowledge discovery tasks and algorithms in an efficient and scalable manner. 1-2

19 Introduction The Knowledge Discovery Process Building these data structures and operations into the DBMS allows mining tasks to be moved into the SQL engine for tighter integration of data and mining operations and for improved performance and scalability. Adding new primitives, such as moving-window aggregate functions, simplifies queries needed by knowledge discovery tools and applications. This type of query simplification often results in significant improvements in performance. The Knowledge Discovery Process The knowledge discovery process is a nine-step process that starts with the selection and definition of a business opportunity, continues through several data preparation steps and a modeling step, and ends with the deployment of the new knowledge. This subsection describes the first step of that process. 1. Identify and define a business opportunity. The process begins with the identification and precise specification of a business opportunity. See Defining the Business Opportunity on page Preprocess and load the data for the business opportunity. Real-world data is often inconsistent and incomplete. The first preparation step is to address these problems by preprocessing the data in various ways for example, verifying and mapping the data. Then load the data into your database system. See Preparing the Data on page Profile and understand the relevant data. Generate a variety of statistics such as column unique entry counts, value ranges, number of missing values, mean, variance, and so on. See Profiling the Data on page Define events relevant to the business opportunity being explored. Events are used to align related data in a single set of columns for mining. Example events are life changes, such as getting married or switching jobs, or customer actions, such as opening an account or requesting a credit limit increase. See Defining Events on page Derive attributes. For example, customer age can be derived from birth date. Account summary statistics, such as maximum and minimum balances, can be derived from monthly status information. See Preparing the Data on page

20 Introduction Defining the Business Opportunity 6. Create the data mining view. Transform the data into a mining view, a form in which all attributes about the primary mining entity occur in a single record. See Creating the Mining View on page Mine the data and build models. Core knowledge discovery techniques are applied to gain insight, learn patterns, or verify hypotheses. The main tasks are either predictive or descriptive in nature. Predictive tasks involve trying to determine what will happen in the future, based upon historical data. Descriptive tasks involve finding patterns describing the data. See Mining the Data on page Deploy models. Deployment can take many different forms. For example, deployment might be as simple as documenting and reporting the results, or deployment might be embedding the model in an operational system to achieve predictive results. 9. Monitor model performance. Performance of the model must be monitored for accuracy. When accuracy begins to decline, the model must be updated to fit the current situation. See Knowledge Deployment and Monitoring on page In Step 1, a business opportunity is identified and defined. In Steps 2 through 6, data mining data is gathered, preprocessed, and organized in a form that is suitable for mining. These steps require the most time in the process. For example, selecting the data is an important step in the process and typically requires the assistance of a data mining expert or subject matter expert who has knowledge of the data to be mined. In Step 7, models are built. In Steps 8 and 9, the models are deployed and monitored. This latter part of the knowledge discovery process focuses on analyzing the data mining view prepared in Steps 2 through 6. Defining the Business Opportunity The process begins with the identification and precise specification of a business opportunity. Several factors must be considered when evaluating potential opportunities: Quantification of the return on investment What is the answer worth? How much money can be saved? How much of a competitive advantage does it offer? 1-4

21 Introduction Defining the Business Opportunity Usability of the results Merely identifying patterns is not enough. The opportunity and analysis must be structured so that any interpretation of results obtained develops into deployable business strategy. Political and organizational reaction In assessing probabilities for organizational resistance, it is helpful to examine similar past efforts and understand why these efforts succeeded or failed. Availability of business analysts and data mining experts and technology Are data, domain, and mining experts available to participate in the process? Is sufficient technology, both hardware and software, available? Data availability Does preclassified data exist or can it be derived? Do sufficiently large amounts of data exist? Both internal and external data sources should be considered. Logistics How difficult is it to collect, extract, and transport the relevant data? Is confidentiality an issue? Careful consideration of these factors helps to ensure that the opportunity selected is both amenable to data mining and likely to provide significant value. After an operation is selected, the next task is to specify it precisely. In the scenario of building a model to predict credit card account attrition, the goal is to build a model that will predict, as early as possible, whether a credit card customer will close their account. To specify this opportunity precisely, decide on an explicit definition of attrition, such as when a customer calls and closes their account. Another option is implicit when a customer stops using their card. For simplicity, define attrition as a customer closing their account or maintaining a zero balance for three months. Another aspect of specifying the opportunity is defining what it means to predict as early as possible when an account will be closed. For this example, choose three months as the prediction window. This window should be long enough to allow the card issuer to take some action to try to retain customers likely to leave, but short enough to capture attrition-related patterns. The goal is to build a model that will predict, as early as possible, customer attrition. Example Business Opportunity The precise specification of our example opportunity is to build a model that will predict at any point in time, based on such things as current account status, account activity, and demographics, whether a credit card customer will close their account in the future. Note that the precise specification of the opportunity might be modified or 1-5

22 Introduction Defining the Business Opportunity refined later in the knowledge discovery process as more information becomes available. This manual uses this opportunity scenario to describe the knowledge discovery process and how to implement it. The data set used to illustrate techniques and SQL/MX features consists of two tables: one containing customer information and the other containing account history information. This data set is presented in Appendix A through C of this manual. A subset of this data set is shown in these tables: Customers Table Account Name Marital Status Home Income Jones, Mary Single Own 65, Abbas, Ali Divorced Rent 32, Kano, Tomoko Divorced Own 44, Lund, Erika Widow Own 28,000 Account History Table Account Month Status Limit Balance Payment Fin. Chrg /03 Open 10, /02 Open 5, /00 Open 6, /03 Open 10, /02 Open 5, The first table, the Customers table, contains one row for each credit card account and consists of customer demographic information such as marital status, income, and so on. For a large financial institution, a customers table such as this one might contain approximately 10 million rows and 100 columns. The second table, the Account History table, contains monthly status records, one for each account for each month the account was open over a given time period, and consists of about 200 columns. For this example, suppose the time period is three years. The history table would then contain about 360 million rows, assuming 10 million customers. Given these parameters, the size of the first table is about 5 GB (10 million rows, 500 bytes in each row), and the size of the second table is about 360 GB (360 million rows, 1000 bytes in each row). For the example business opportunity, the Status and Balance fields of the Account History table are used to determine if a customer will close their account. If the Status changes from Open to Closed or if the Balance is zero for three consecutive months, 1-6

23 Introduction Preparing the Data then a customer is defined as having left that is, no longer holds a credit card account. Preparing the Data After a business opportunity has been identified and defined, the next task is to prepare a data set for mining. This is done in Steps 2 through 6 of the knowledge discovery process. See The Knowledge Discovery Process on page 1-3. The first two steps are preprocessing the mining data to make it consistent and then loading the data into a database system. For further information, see Loading the Data on page 2-2. The next step is to generate a variety of statistics for example, column unique entry counts, value ranges, number of missing values, mean, variance, and so on. This type of data profile is helpful in gaining an understanding of the data, and this profile also serves as a valuable reference throughout the knowledge discovery process. Profiling the Data A profile of the database helps to solve the data mining problem in these ways: To better understand the data To decide which columns to use for analysis To decide whether to treat attributes as discrete or continuous Types of Information The type of information used to create a profile of the data mining view comes from the following elements: Tables in the database Table attributes (or columns to be used in the analysis) Data types of the table attributes Relationships between tables Cardinalities of discrete attributes Statistics about continuous attributes Derived table attributes (or derived columns to be used in the analysis) Determining the derived columns to be constructed requires knowledge of the table attributes and how these attributes relate to the data mining problem. See Preparing the Data on page 1-7 for a full discussion of these elements. SQL/MX provides the TRANSPOSE clause of the SELECT statement to display the cardinalities of discrete attributes. See Transposition on page 2-3 and the TRANSPOSE Clause entry in the SQL/MX Reference Manual for details. Example of Finding Cardinality of Discrete Attributes The customers table in your data set has Age and Number_Children columns. Both of these attributes are discrete, and you can compute the cardinality of each attribute. 1-7

24 Introduction Preparing the Data You obtain the cardinality of an attribute, which is the count of the number of unique values for the attribute, by using a COUNT DISTINCT query. For example: SELECT COUNT(DISTINCT Age) FROM Customers; or SELECT COUNT(DISTINCT Number_Children) FROM Customers; Instead of having to submit a query for each attribute, you can obtain counts for multiple attributes of a table by using the TRANSPOSE clause. For example: SET NAMETYPE ANSI; SET SCHEMA dmcat.whse; SELECT ColumnIndex, COUNT(DISTINCT ColumnValue) FROM Customers TRANSPOSE Age, Number_Children AS ColumnValue KEY BY ColumnIndex GROUP BY ColumnIndex; COLUMNINDEX (EXPR) row(s) selected. The first row of the result table of the TRANSPOSE clause contains the distinct count for the column Age, and the second row contains the distinct count for the column Number_Children. You can treat the Age values as categories, consisting of age ranges. Similarly, if Number_Children is greater than five, you can put the count into the category for the Number_Children equal to five. The number of attributes in a TRANSPOSE clause is unlimited. Note. The data types of attributes to be transformed into a single column must be compatible. The data type of the result column is the union compatible data type of the attributes. For further information, see Profiling the Data on page 2-2. Defining Events In the scenario considered in this manual, the relevant event is the account holder leaving. This event occurs at different points in time for customers that leave and not at all for customers that stay. This event must be defined so that account status and activity in the months leading up to a customer leaving can be located and aligned in columns. For example, suppose you create three derived attributes that describe the account balance for each of the 1-8

25 Introduction Preparing the Data three months before a customer leaves, because these attributes are predictors of attrition. For the customers that do leave, the months leading up to leaving occur at various points in time. For customers that do not leave, these months are chosen to be any three consecutive months in which the account is open. The information about these months should be aligned for all accounts in a single set of columns, one for each of the three months. Most mining algorithms require a single logical attribute, such as the balance one month before leaving, to be stored in one column in all records, rather than in different columns in different records. For example, consider this data in a table that contains monthly account balances for each month in the three-year history period: Account... Bal 08/03 Bal 09/03 Bal 10/03 Bal 11/03... Left Yes (closed) Yes (0 bal) Account... Bal 07/02 Bal 08/02 Bal 09/02 Bal 10/02... Left Yes (closed) The balances prior to the event (of the customer leaving) are in different date columns for these accounts, and therefore algorithms that build predictive models are not able to consider this information. A table organization that allows this information to be considered: Account... Bal-3 Bal-2 Bal-1 Date Left... Left /03 Yes (closed) /03 Yes (0 bal) /03 Yes (closed) In this table, columns Bal-1 through Bal-3 contain account balances one through three months prior to a customer leaving. Consequently, this information is aligned within a single set of columns and can be considered during model creation. For further information, see Defining Events on page 2-6. Deriving Attributes The next task is to derive attributes that are not relative to events. For example, customer age can be derived from birth date. Part of the challenge of effective data mining is identifying a set of derived attributes that capture key indicators relevant to the business opportunity being explored. For further information, see Deriving Attributes on page

26 Introduction Creating the Mining View Creating the Mining View The final data preparation step is to transform the data set into a mining view, a form in which all attributes about the main mining entity appear in a single record. The mining entity used in this manual is a credit card account. The data mining challenge is to determine predictors for when a customer will close a credit card account. Transforming the data set to a single record for each mining entity often involves a pivot operation, in which attributes in multiple rows are collapsed and put into a single row. For example, in the credit card example, the set of history records associated with each account is collapsed to a single record and then appended to the corresponding customer record. For further information, see Section 3, Creating the Data Mining View. The resulting table looks similar to this: Mining View Account Mar Status Income Bal-3 Bal-2 Bal-1 Date Left Left Single 65, /99 Yes Divorced 32, /99 Yes Divorced 44, /98 Yes Married 32,000 No This table contains demographic information from the Customers table, such as marital status and income, and also pivoted columns from the Account History table, such as balances prior to leaving. You use example data set in the data mining step, the next step in the knowledge discovery process. Mining the Data In the data mining step, core knowledge discovery techniques are applied to gain insight, learn patterns, or verify hypotheses. The main tasks performed in this step are either predictive or descriptive in nature. Predictive tasks involve trying to determine what will happen in the future, based upon historical data. Descriptive tasks involve finding patterns describing the data. The task used in this customer scenario is predictive: to build a model to predict attrition of credit card customers based on historical information, such as demographics and account activity. The most common predictive tasks are: Classification Classify a case (or record) into one of several predefined classes. Regression Map a case (or record) into a numerical prediction value. 1-10

27 Introduction Knowledge Deployment and Monitoring Descriptive tasks involve finding patterns describing the data. The most common are: Database segmentation (clustering) Map a case into one of several clusters. Summarization Provide a compact description of the data, often in visual form. Link analysis Determine relationships between attributes in a case. Sequence analysis Determine trends over time. You use a variety of algorithms, and the models they produce, to perform these predictive and descriptive tasks. For example, classification can be done by building a decision tree model, where each branch of the tree is represented by a predicate involving attributes in the mining data set and where each branch is homogeneous with respect to whether the predicate is true or false. The main task in classification is to determine which predicates form the decision tree that predicts the goal. The most common algorithms for classification come from the field of machine learning in computer science. Typically, the model building step involves the use of client-mining tools that require the interactive participation of the user to guide the investigation. A description of these special-purpose tools is beyond the scope of this manual. For further information, see Section 4, Mining the Data. Knowledge Deployment and Monitoring The last two steps of the knowledge discovery process involve deploying and monitoring discovered knowledge. Deployment can take many different forms. For example, deployment might be as simple as documenting and reporting the results, or deployment might be embedding the model in an operational system to achieve predictive results. Most data mining tools support model deployment either by applying a model to data within the tool or by exporting a model as executable code, which can then be embedded and used in applications. In the credit card attrition example, one form of model deployment is to periodically use the model to identify profitable customers that are likely to leave, and then to take some action, such as lowering interest rates or waiving fees, to try to retain these customers. 1-11

28 Introduction Knowledge Deployment and Monitoring 1-12

29 2 Preparing the Data Section 1, Introduction identifies and defines a business opportunity, the first step in the knowledge discovery process supported by SQL/MX. This section describes Steps 2 through Identify and define a business opportunity. 2. Preprocess and load the data for the business opportunity. The first preparation step is to address these problems by preprocessing the data in various ways for example, verifying and mapping the data. Then load the data into your database system. See Loading the Data on page Profile and understand the relevant data. Generate a variety of statistics, such as column unique entry counts, value ranges, number of missing values, mean, variance, and so on. See Profiling the Data on page Define events relevant to the business opportunity being explored. Events are used to align related data in a single set of columns for mining. Example events are life changes, such as getting married or switching jobs, or customer actions, such as opening an account or requesting a credit limit increase. See Defining Events on page Derive attributes. For example, customer age can be derived from birth date. Account summary statistics, such as maximum and minimum balances, can be derived from monthly status information. See Deriving Attributes on page Create the data mining view. 7. Mine the data and build models. 8. Deploy models. 9. Monitor model performance. 2-1

30 Preparing the Data Loading the Data Loading the Data The first step in preparing a data set for mining is loading the data into database tables. Suppose the credit card organization has a customers data warehouse. The customer data and the account history data are stored in this warehouse. In a typical real-world scenario, the warehouse could have millions of records representing millions of customers dating back many years. Creating the Database Suppose a data mining database is created consisting of the Customers table and the Account History table described in the previous section. You can use the DDL scripts included with this manual to create a database to run the examples in this manual. To create the database: 1. Open the.pdf file for this manual. 2. Navigate to Appendix A, Creating the Data Mining Database of this manual, which contains the DDL script that creates the database. 3. On the tool bar, select the Table/Formatted Text Select Tool. 4. Copy and paste from the DDL script, one page at a time, into an OSS text file. 5. Within MXCI (the SQL/MX conversational interface), obey the OSS file you have created. Importing Data Into the Database After the data mining database is created, the warehouse data is imported into the database. In a typical real-world scenario, you would import the data by using some type of database utility for example, you can use the DataLoader/MP utility to import a large quantity of data into an SQL/MP database. For further information, see the DataLoader/MX Reference Manual and the SQL/MX Reference Manual for discussions of the Import Utility. Alternatively, you can also use INSERT statements to insert values into the data mining database. The INSERT statements for the example in this manual are included in Appendix B, Inserting Into the Data Mining Database. Profiling the Data Profiling often begins with the computation of basic information about each attribute. For discrete attributes, this basic information is typically a table of the unique values and a count of how many times each value occurs. However, as cardinality increases, these frequencies become less and less meaningful. For continuous attributes, the approach is to use metrics such as minimum, maximum, mean, and variance. 2-2

31 Preparing the Data Cardinalities and Metrics Cardinalities and Metrics For any attribute, one approach to profiling is to run a separate query for each attribute. As an example, consider the following queries, which profile the discrete attribute Marital Status from the Customers table and the continuous attribute Balance from the Account History table. Example of Discrete Attribute This query finds the number of discrete values of the Marital Status column of the Customers table: SELECT marital_status, COUNT(*) FROM customers GROUP BY marital_status; Example of Continuous Attribute This query computes statistical information about the continuous attribute Balance in the Account History table: SELECT MIN(balance), MAX(balance), AVG(balance), VARIANCE(balance) FROM acct_history; Transposition Other than the computation of a few metrics, both of the previous queries require a complete scan of the data. In this way, a table with N attributes requires N queries, resulting in the same number of complete scans. For a wide mining table, this procedure can result in thousands of queries and scans of the data. Using transposition, SQL/MX can perform the above profiling operations by using a total of only two queries, regardless of the number of attributes to be profiled. Through the TRANSPOSE clause of the SELECT statement, different columns of a source table can be treated as a single output column, enabling similar computations to be performed on all such source columns. TRANSPOSE takes each row in the source table and converts each expression listed in the transpose set to an individual output row. Used in this way, TRANSPOSE can compute frequency counts for all discrete attributes in a table in a single query. See the TRANSPOSE Clause entry in the SQL/MX Reference Manual for more information. Example of Computing Counts for Character Discrete Attributes This query computes the frequency counts for the discrete attributes Gender, Marital Status, and Home, which are all type character: SET NAMETYPE ANSI; SET SCHEMA mining.whse; 2-3

32 Preparing the Data Transposition SELECT attr, c1, COUNT(*) FROM customers TRANSPOSE ('GENDER', gender), ('HOME', home), ('MARITAL_STATUS', marital_status) AS (attr, c1) GROUP BY attr, c1 ORDER BY attr, c1; ATTR C1 (EXPR) GENDER F 20 GENDER M 22 HOME Own 33 HOME Rent 9 MARITAL_STATUS Divorced 12 MARITAL_STATUS Married 9 MARITAL_STATUS Single 15 MARITAL_STATUS Widow row(s) selected. Because this query produces counts for three different attributes, use the ATTR column to distinguish from which attribute the values are drawn. The C1 column contains the values for these character attributes. Example of Computing Counts for Character and Numeric Discrete Attributes This query also shows the transpose clause and illustrates how profiling can be achieved. The column C2 has been added to the statement because Number_Children has numeric data type. SELECT attr, c1, c2, COUNT(*) FROM customers TRANSPOSE ('GENDER', gender, null), ('HOME', home, null), ('MARITAL_STATUS', marital_status, null), ('NUMBER_CHILDREN', null, number_children) AS (attr, c1, c2) GROUP BY attr, c1, c2 ORDER BY attr, c1, c2; ATTR C1 C2 (EXPR) GENDER F? 20 GENDER M? 22 HOME Own? 33 HOME Rent? 9 MARITAL_STATUS Divorced? 12 MARITAL_STATUS Married? 9 MARITAL_STATUS Single? 15 MARITAL_STATUS Widow? 6 NUMBER_CHILDREN? 0 25 NUMBER_CHILDREN?