HP NonStop SQL/MX Data Mining Guide

Size: px
Start display at page:

Download "HP NonStop SQL/MX Data Mining Guide"

Transcription

1 HP NonStop SQL/MX Data Mining Guide Abstract This manual presents a nine-step knowledge-discovery process, which was developed over a series of data mining investigations. This manual describes the data structures and operations of the NonStop SQL/MX approach and implementation. Product Version NonStop SQL/MX Release 2.0 Supported Release Version Updates (RVUs) This publication supports G06.23 and all subsequent G-series releases until otherwise indicated by its replacement publication. Part Number Published April 2004

2 Document History Part Number Product Version Published NonStop SQL/MX Release 1.0 February NonStop SQL/MX Release 2.0 April 2004

3 HP NonStop SQL/MX Data Mining Guide Index Figures Tables What s New in This Manual iii Manual Information iii New and Changed Information iii About This Manual v Audience v Organization v Related Documentation vi Notation Conventions viii 1. Introduction The Traditional Approach 1-1 The SQL/MX Approach 1-2 Data-Intensive Computations Performed in the DBMS 1-2 Use of Built-In DBMS Data Structures and Operations 1-2 The Knowledge Discovery Process 1-3 Defining the Business Opportunity 1-4 Preparing the Data 1-7 Creating the Mining View 1-10 Mining the Data 1-10 Knowledge Deployment and Monitoring Preparing the Data Loading the Data 2-2 Creating the Database 2-2 Importing Data Into the Database 2-2 Profiling the Data 2-2 Cardinalities and Metrics 2-3 Transposition 2-3 Quick Profiling 2-5 Defining Events 2-6 Aligning the Data 2-6 Deriving Attributes 2-9 Hewlett-Packard Company i

4 Contents 2. Preparing the Data (continued) 2. Preparing the Data (continued) Moving Metrics 2-9 Rankings Creating the Data Mining View Creating the Single Table 3-2 Pivoting the Data Mining the Data Building the Model 4-2 Building Decision Trees 4-2 Checking the Model 4-9 Applying the Model to the Mining Table 4-10 Applying the Model to the Database 4-10 Deploying the Model 4-10 Monitoring Model Performance 4-11 A. Creating the Data Mining Database B. Inserting Into the Data Mining Database C. Importing Into the Data Mining Database Index Importing Customers Data C-1 Customers Format File C-1 Customers Data File C-1 Importing Account History Data C-2 Account History Format File C-3 Account History Data File C-3 Figures Tables Figure 4-1. Initial Branches of Decision Tree 4-4 Figure 4-2. Decision Tree for Divorced Branch 4-5 Figure 4-3. Decision Tree for Single Branch 4-6 Figure 4-4. Final Decision Tree 4-9 Table i. Manual Organization v ii

5 What s New in This Manual Manual Information Abstract HP NonStop SQL/MX Data Mining Guide This manual presents a nine-step knowledge-discovery process, which was developed over a series of data mining investigations. This manual describes the data structures and operations of the NonStop SQL/MX approach and implementation. Product Version NonStop SQL/MX Release 2.0 Supported Release Version Updates (RVUs) This publication supports G06.23 and all subsequent G-series releases until otherwise indicated by its replacement publication. Part Number Published April 2004 Document History Part Number Product Version Published NonStop SQL/MX Release 1.0 February NonStop SQL/MX Release 2.0 April 2004 New and Changed Information This publication has been updated to reflect new product names: Since product names are changing over time, this publication might contain both HP and Compaq product names. Product names in graphic representations are consistent with the current product interface. The technical content of this guide has been updated and reflects the state of the product at the G06.23 RVU. Previous versions of the guide used the Object Relational Data Mining (ORDM) approach and architecture. ORDM advocates performing data mining and other parts of the knowledge discovery process against data in the SQL/MX data base. This technique has been updated. Readers are encouraged to perform the data iii

6 What s New in This Manual New and Changed Information preparation steps in SQL/MX but reserve the mining or model building for UNIX or Microsoft Windows platforms. All sections of the manual have been updated to reflect the impact of major changes of SQL/MX Release 2.0 (for example, the introduction of SQL/MX tables). Introductions to the data preparation steps have been revised and rewritten. The DDL statements in Appendix A, B, and C have been updated to use SQL/MX DDL syntax. Appendix A syntax has been removed. Readers can consult the SQL/MX Reference Manual for the most current syntax and examples. Index entries have been added, updated, and corrected. iv

7 About This Manual This manual presents a nine-step knowledge discovery process, which was developed over a series of data mining investigations. This manual describes the data structures and operations of the NonStop SQL/MX approach and implementation. Audience This manual is intended for database administrators and application programmers who are using NonStop SQL/MX to solve data mining problems, either through the SQL conversational interface or through embedded SQL programs. Organization The sections listed in Table i describe the knowledge discovery process (or the data mining process) and present examples that carry out the process. The appendixes listed in Table i provide the syntax for the data mining features of NonStop SQL/MX and the SQL scripts that create the data mining database used in the examples. Table i. Manual Organization Section Section 1, Introduction Section 2, Preparing the Data Section 3, Creating the Data Mining View Section 4, Mining the Data Appendix A, Creating the Data Mining Database Appendix B, Inserting Into the Data Mining Database Appendix C, Importing Into the Data Mining Database Description Presents an overview of the knowledge discovery process and the SQL/MX approach to this process. Defines the example business opportunity used in this manual. Describes the data preparation steps of the knowledge discovery process. Describes how to create the mining view. Describes the data mining steps of the knowledge discovery process. Contains DDL statement scripts that you can use to create the data mining database used in the examples in this manual. Contains INSERT statement scripts that you can use to populate the data mining database used in this manual. Contains IMPORT statement scripts that you can use to create the data mining database used in this manual. v

8 About This Manual Related Documentation Related Documentation This manual is part of the SQL/MX library of manuals, which includes: Introductory Guides SQL/MX Comparison Guide for SQL/MP Users SQL/MX Quick Start Reference Manuals SQL/MX Reference Manual SQL/MX Connectivity Service Command Reference DataLoader/MX Reference Manual SQL/MX Messages Manual SQL/MX Glossary Programming Manuals SQL/MX Programming Manual for C and COBOL SQL/MX Programming Manual for Java SQL/MX Guide to Stored Procedures in Java Describes SQL differences between SQL/MP and SQL/MX. Describes basic techniques for using SQL in the SQL/MX conversational interface (MXCI). Includes information about installing the sample database. Describes the syntax of SQL/MX statements, MXCI commands, functions, and other SQL/MX language elements. Describes the SQL/MX administrative command library (MACL) available with the SQL/MX conversational interface (MXCI). Describes the features and functions of the DataLoader/MX product, a tool to load SQL/MX databases. Describes SQL/MX messages. Defines SQL/MX terminology. Describes how to embed SQL/MX statements in ANSI C and COBOL programs. Describes how to embed SQL/MX statements in Java programs according to the SQLJ standard. Describes how to use stored procedures that are written in Java within SQL/MX. vi

9 About This Manual Related Documentation Specialized Guides SQL/MX Installation and Management Guide SQL/MX Query Guide SQL/MX Data Mining Guide SQL/MX Queuing and Publish/Subscribe Services SQL/MX Report Writer Guide SQL/MX Connectivity Service Manual Online Help The SQL/MX Online Help consists of: Reference Help Messages Help Glossary Help NSM/web Help Describes how to plan, install, create, and manage an SQL/MX database. Explains how to use installation and management commands and utilities. Describes how to understand query execution plans and write optimal queries for an SQL/MX database. Describes the SQL/MX data structures and operations to carry out the knowledge-discovery process. Describes how SQL/MX integrates transactional queuing and publish/subscribe services into its database infrastructure. Describes how to produce formatted reports using data from a NonStop SQL/MX database. Describes how to install and manage the SQL/MX Connectivity Service (MXCS), which enables applications developed for the Microsoft Open Database Connectivity (ODBC) application programming interface (API) and other connectivity APIs to use SQL/MX. Overview and reference entries from the SQL/MX Reference Manual. Individual messages grouped by source from the SQL/MX Messages Manual. Terms and definitions from the SQL/MX Glossary. The following manuals are part of the SQL/MP library of manuals and are essential references for information about SQL/MP Data Definition Language (DDL) and SQL/MP installation and management: Related SQL/MP Manuals SQL/MP Reference Manual SQL/MP Installation and Management Guide Context-sensitive help topics that describe how to use the NSM/web management tool. Describes the SQL/MP language elements, expressions, predicates, functions, and statements. Describes how to plan, install, create, and manage an SQL/MP database. Describes installation and management commands and SQL/MP catalogs and files. vii

10 About This Manual Notation Conventions This figure shows the manuals in the SQL/MX library: Introductory Guides Programming Manuals SQL/MX Comparison Guide for SQL/MP Users SQL/MX Quick Start SQL/MX Programming Manual for C and COBOL SQL/MX Programming Manual for Java SQL/MX Guide to Stored Procedures in Java Reference Manuals SQL/MX Reference Manual SQL/MX Messages Manual SQL/MX Glossary SQL/MX Connectivity Service Command Reference Data- Loader/MX Reference Manual Specialized Guides SQL/MX Online Help SQL/MX Installation and Management Guide SQL/MX Query Guide SQL/MX Data Mining Guide Reference Help Messages Help SQL/MX Queuing and Publish/ Subscribe Services SQL/MX Report Writer Guide SQL/MX Connectivity Service Manual Glossary Help NSM/web Help VST001.vsd Notation Conventions Hypertext Links Blue underline is used to indicate a hypertext link within text. By clicking a passage of text with a blue underline, you are taken to the location described. For example: viii

11 About This Manual General Syntax Notation This requirement is described under Backup DAM Volumes and Physical Disk Drives on page 3-2. General Syntax Notation This list summarizes the notation conventions for syntax presentation in this manual. UPPERCASE LETTERS. Uppercase letters indicate keywords and reserved words. Type these items exactly as shown. Items not enclosed in brackets are required. For example: MAXATTACH lowercase italic letters. Lowercase italic letters indicate variable items that you supply. Items not enclosed in brackets are required. For example: file-name computer type. Computer type letters within text indicate C and Open System Services (OSS) keywords and reserved words. Type these items exactly as shown. Items not enclosed in brackets are required. For example: myfile.c italic computer type. Italic computer type letters within text indicate C and Open System Services (OSS) variable items that you supply. Items not enclosed in brackets are required. For example: pathname [ ] Brackets. Brackets enclose optional syntax items. For example: TERM [\system-name.]$terminal-name INT[ERRUPTS] A group of items enclosed in brackets is a list from which you can choose one item or none. The items in the list can be arranged either vertically, with aligned brackets on each side of the list, or horizontally, enclosed in a pair of brackets and separated by vertical lines. For example: FC [ num ] [ -num ] [ text ] K [ X D ] address { } Braces. A group of items enclosed in braces is a list from which you are required to choose one item. The items in the list can be arranged either vertically, with aligned ix

12 About This Manual General Syntax Notation braces on each side of the list, or horizontally, enclosed in a pair of braces and separated by vertical lines. For example: LISTOPENS PROCESS { $appl-mgr-name } { $process-name } ALLOWSU { ON OFF } Vertical Line. A vertical line separates alternatives in a horizontal list that is enclosed in brackets or braces. For example: INSPECT { OFF ON SAVEABEND } Ellipsis. An ellipsis immediately following a pair of brackets or braces indicates that you can repeat the enclosed sequence of syntax items any number of times. For example: M address [, new-value ] [ - ] { } An ellipsis immediately following a single syntax item indicates that you can repeat that syntax item any number of times. For example: "s-char " Punctuation. Parentheses, commas, semicolons, and other symbols not previously described must be typed as shown. For example: error := NEXTFILENAME ( file-name ) ; LISTOPENS SU $process-name.#su-name Quotation marks around a symbol such as a bracket or brace indicate the symbol is a required character that you must type as shown. For example: "[" repetition-constant-list "]" Item Spacing. Spaces shown between items are required unless one of the items is a punctuation symbol such as a parenthesis or a comma. For example: CALL STEPMOM ( process-id ) ; If there is no space between two items, spaces are not permitted. In this example, no spaces are permitted between the period and any other items: $process-name.#su-name Line Spacing. If the syntax of a command is too long to fit on a single line, each continuation line is indented three spaces and is separated from the preceding line by a blank line. This spacing distinguishes items in a continuation line from items in a vertical list of selections. For example: ALTER [ / OUT file-spec / ] LINE [, attribute-spec ] x

13 About This Manual Notation for Messages!i and!o. In procedure calls, the!i notation follows an input parameter (one that passes data to the called procedure); the!o notation follows an output parameter (one that returns data to the calling program). For example: CALL CHECKRESIZESEGMENT ( segment-id!i, error ) ;!o!i,o. In procedure calls, the!i,o notation follows an input/output parameter (one that both passes data to the called procedure and returns data to the calling program). For example: error := COMPRESSEDIT ( filenum ) ;!i,o!i:i. In procedure calls, the!i:i notation follows an input string parameter that has a corresponding parameter specifying the length of the string in bytes. For example: error := FILENAME_COMPARE_ ( filename1:length!i:i, filename2:length ) ;!i:i!o:i. In procedure calls, the!o:i notation follows an output buffer parameter that has a corresponding input parameter specifying the maximum length of the output buffer in bytes. For example: error := FILE_GETINFO_ ( filenum!i, [ filename:maxlen ] ) ;!o:i Notation for Messages This list summarizes the notation conventions for the presentation of displayed messages in this manual. Bold Text. Bold text in an example indicates user input typed at the terminal. For example: ENTER RUN CODE?123 CODE RECEIVED: The user must press the Return key after typing the input. Nonitalic text. Nonitalic letters, numbers, and punctuation indicate text that is displayed or returned exactly as shown. For example: Backup Up. lowercase italic letters. Lowercase italic letters indicate variable items whose values are displayed or returned. For example: p-register process-name xi

14 About This Manual Notation for Management Programming Interfaces [ ] Brackets. Brackets enclose items that are sometimes, but not always, displayed. For example: Event number = number [ Subject = first-subject-value ] A group of items enclosed in brackets is a list of all possible items that can be displayed, of which one or none might actually be displayed. The items in the list can be arranged either vertically, with aligned brackets on each side of the list, or horizontally, enclosed in a pair of brackets and separated by vertical lines. For example: proc-name trapped [ in SQL in SQL file system ] { } Braces. A group of items enclosed in braces is a list of all possible items that can be displayed, of which one is actually displayed. The items in the list can be arranged either vertically, with aligned braces on each side of the list, or horizontally, enclosed in a pair of braces and separated by vertical lines. For example: obj-type obj-name state changed to state, caused by { Object Operator Service } process-name State changed from old-objstate to objstate { Operator Request. } { Unknown. } Vertical Line. A vertical line separates alternatives in a horizontal list that is enclosed in brackets or braces. For example: Transfer status: { OK Failed } % Percent Sign. A percent sign precedes a number that is not in decimal notation. The % notation precedes an octal number. The %B notation precedes a binary number. The %H notation precedes a hexadecimal number. For example: % %B %H2F P=%p-register E=%e-register Notation for Management Programming Interfaces This list summarizes the notation conventions used in the boxed descriptions of programmatic commands, event messages, and error lists in this manual. UPPERCASE LETTERS. Uppercase letters indicate names from definition files. Type these names exactly as shown. For example: ZCOM-TKN-SUBJ-SERV xii

15 About This Manual Notation for Management Programming Interfaces lowercase letters. Words in lowercase letters are words that are part of the notation, including Data Definition Language (DDL) keywords. For example: token-type!r.!o. The!r notation following a token or field name indicates that the token or field is required. For example: ZCOM-TKN-OBJNAME token-type ZSPI-TYP-STRING.!r The!o notation following a token or field name indicates that the token or field is optional. For example: ZSPI-TKN-MANAGER token-type ZSPI-TYP-FNAME32.!o xiii

16 About This Manual Notation for Management Programming Interfaces xiv

17 1 Introduction Knowledge discovery is an iterative process involving many query-intensive steps. The challenges of data management in supporting this process efficiently are significant and continue to grow as knowledge discovery becomes more widely used. Data mining identifies and characterizes interrelationships among multiple variables without requiring a data analyst to formulate specific questions. Software tools look for trends and patterns and flag unusual or potentially interesting ones. Because data mining reveals previously unknown information and patterns, rather than proving or disproving a hypothesis, mining enables knowledge discovery rather than just knowledge verification. Knowledge discovery is an iterative process involving many query-intensive steps. The challenges of data management in supporting this process efficiently are significant and continue to grow as knowledge discovery becomes more widely used. This section discusses these approaches to data mining: The Traditional Approach Today, most data mining is performed in the database by using client tools. This approach is limited because important information might be omitted from the data extract. The SQL/MX Approach The SQL/MX approach to knowledge discovery enables you to perform many data intensive tasks in the database itself, rather than on extracts. Examples include statistical sampling, statistical functions, temporal reasoning through sequence functions, cross-table generation, database profiling, and moving-window aggregations. The Knowledge Discovery Process In the SQL/MX approach, fundamental data structures and operations are built into the database management system (DBMS) to support a wide range of knowledge discovery tasks and algorithms. The knowledge discovery process is described as a series of steps that starts with the selection and definition of a business opportunity, continues through data preparation and modeling, and ends with the deployment of the new knowledge. The Traditional Approach Today s traditional knowledge discovery systems consist of an application program on top of a data source. The main emphasis in these systems is data mining inventing new techniques and algorithms, proving their statistical soundness, and validating their effectiveness given a suitable problem. Data should be available in a convenient form, typically a flat file, extracted from an appropriate data source. The knowledge discovery system consists of specific 1-1

18 Introduction The SQL/MX Approach algorithms that load the entire data set into memory and perform necessary computations. The extract approach has two major limitations: It does not scale to large data sets because the entire data set is required to fit in memory. Statistical sampling can be used to avoid this limitation. However, sampling is inappropriate in many situations because sampling might cause patterns to be missed, such as those in small groups or those between records. It cannot conveniently manage multiple versions of data across numerous iterations of a typical knowledge discovery investigation. For example, each iteration might require extracting additional data, performing incremental updates, deriving new attributes, and so on. The SQL/MX Approach In most enterprise organizations today, database systems are crucial for conducting business. The DBMS systems serve as the transaction processing systems for daily operations and manage data warehouses containing huge amounts of historical information. The validated data in these warehouses is already being used for online analysis and is a natural starting point for knowledge discovery. The SQL/MX approach identifies fundamental data structures and operations that are common across a wide range of knowledge discovery tasks and builds such structures and operations into the DBMS. The primary advantages of the SQL/MX technology over traditional data mining techniques include: The ability to mine much larger data sets, not only data in flat-file extracts Simplified data management More complete results Better performance and reduced cycle times The main features of the SQL/MX approach are summarized next. Data-Intensive Computations Performed in the DBMS Tools and applications perform data-intensive data-preparation tasks in the DBMS by using an SQL interface. As a result, you can access the powerful and parallel DBMS data manipulation capabilities in the data preparation stage of the knowledge discovery process. Use of Built-In DBMS Data Structures and Operations Fundamental data structures and operations are built into the DBMS to support a wide range of knowledge discovery tasks and algorithms in an efficient and scalable manner. 1-2

19 Introduction The Knowledge Discovery Process Building these data structures and operations into the DBMS allows mining tasks to be moved into the SQL engine for tighter integration of data and mining operations and for improved performance and scalability. Adding new primitives, such as moving-window aggregate functions, simplifies queries needed by knowledge discovery tools and applications. This type of query simplification often results in significant improvements in performance. The Knowledge Discovery Process The knowledge discovery process is a nine-step process that starts with the selection and definition of a business opportunity, continues through several data preparation steps and a modeling step, and ends with the deployment of the new knowledge. This subsection describes the first step of that process. 1. Identify and define a business opportunity. The process begins with the identification and precise specification of a business opportunity. See Defining the Business Opportunity on page Preprocess and load the data for the business opportunity. Real-world data is often inconsistent and incomplete. The first preparation step is to address these problems by preprocessing the data in various ways for example, verifying and mapping the data. Then load the data into your database system. See Preparing the Data on page Profile and understand the relevant data. Generate a variety of statistics such as column unique entry counts, value ranges, number of missing values, mean, variance, and so on. See Profiling the Data on page Define events relevant to the business opportunity being explored. Events are used to align related data in a single set of columns for mining. Example events are life changes, such as getting married or switching jobs, or customer actions, such as opening an account or requesting a credit limit increase. See Defining Events on page Derive attributes. For example, customer age can be derived from birth date. Account summary statistics, such as maximum and minimum balances, can be derived from monthly status information. See Preparing the Data on page

20 Introduction Defining the Business Opportunity 6. Create the data mining view. Transform the data into a mining view, a form in which all attributes about the primary mining entity occur in a single record. See Creating the Mining View on page Mine the data and build models. Core knowledge discovery techniques are applied to gain insight, learn patterns, or verify hypotheses. The main tasks are either predictive or descriptive in nature. Predictive tasks involve trying to determine what will happen in the future, based upon historical data. Descriptive tasks involve finding patterns describing the data. See Mining the Data on page Deploy models. Deployment can take many different forms. For example, deployment might be as simple as documenting and reporting the results, or deployment might be embedding the model in an operational system to achieve predictive results. 9. Monitor model performance. Performance of the model must be monitored for accuracy. When accuracy begins to decline, the model must be updated to fit the current situation. See Knowledge Deployment and Monitoring on page In Step 1, a business opportunity is identified and defined. In Steps 2 through 6, data mining data is gathered, preprocessed, and organized in a form that is suitable for mining. These steps require the most time in the process. For example, selecting the data is an important step in the process and typically requires the assistance of a data mining expert or subject matter expert who has knowledge of the data to be mined. In Step 7, models are built. In Steps 8 and 9, the models are deployed and monitored. This latter part of the knowledge discovery process focuses on analyzing the data mining view prepared in Steps 2 through 6. Defining the Business Opportunity The process begins with the identification and precise specification of a business opportunity. Several factors must be considered when evaluating potential opportunities: Quantification of the return on investment What is the answer worth? How much money can be saved? How much of a competitive advantage does it offer? 1-4

21 Introduction Defining the Business Opportunity Usability of the results Merely identifying patterns is not enough. The opportunity and analysis must be structured so that any interpretation of results obtained develops into deployable business strategy. Political and organizational reaction In assessing probabilities for organizational resistance, it is helpful to examine similar past efforts and understand why these efforts succeeded or failed. Availability of business analysts and data mining experts and technology Are data, domain, and mining experts available to participate in the process? Is sufficient technology, both hardware and software, available? Data availability Does preclassified data exist or can it be derived? Do sufficiently large amounts of data exist? Both internal and external data sources should be considered. Logistics How difficult is it to collect, extract, and transport the relevant data? Is confidentiality an issue? Careful consideration of these factors helps to ensure that the opportunity selected is both amenable to data mining and likely to provide significant value. After an operation is selected, the next task is to specify it precisely. In the scenario of building a model to predict credit card account attrition, the goal is to build a model that will predict, as early as possible, whether a credit card customer will close their account. To specify this opportunity precisely, decide on an explicit definition of attrition, such as when a customer calls and closes their account. Another option is implicit when a customer stops using their card. For simplicity, define attrition as a customer closing their account or maintaining a zero balance for three months. Another aspect of specifying the opportunity is defining what it means to predict as early as possible when an account will be closed. For this example, choose three months as the prediction window. This window should be long enough to allow the card issuer to take some action to try to retain customers likely to leave, but short enough to capture attrition-related patterns. The goal is to build a model that will predict, as early as possible, customer attrition. Example Business Opportunity The precise specification of our example opportunity is to build a model that will predict at any point in time, based on such things as current account status, account activity, and demographics, whether a credit card customer will close their account in the future. Note that the precise specification of the opportunity might be modified or 1-5

22 Introduction Defining the Business Opportunity refined later in the knowledge discovery process as more information becomes available. This manual uses this opportunity scenario to describe the knowledge discovery process and how to implement it. The data set used to illustrate techniques and SQL/MX features consists of two tables: one containing customer information and the other containing account history information. This data set is presented in Appendix A through C of this manual. A subset of this data set is shown in these tables: Customers Table Account Name Marital Status Home Income Jones, Mary Single Own 65, Abbas, Ali Divorced Rent 32, Kano, Tomoko Divorced Own 44, Lund, Erika Widow Own 28,000 Account History Table Account Month Status Limit Balance Payment Fin. Chrg /03 Open 10, /02 Open 5, /00 Open 6, /03 Open 10, /02 Open 5, The first table, the Customers table, contains one row for each credit card account and consists of customer demographic information such as marital status, income, and so on. For a large financial institution, a customers table such as this one might contain approximately 10 million rows and 100 columns. The second table, the Account History table, contains monthly status records, one for each account for each month the account was open over a given time period, and consists of about 200 columns. For this example, suppose the time period is three years. The history table would then contain about 360 million rows, assuming 10 million customers. Given these parameters, the size of the first table is about 5 GB (10 million rows, 500 bytes in each row), and the size of the second table is about 360 GB (360 million rows, 1000 bytes in each row). For the example business opportunity, the Status and Balance fields of the Account History table are used to determine if a customer will close their account. If the Status changes from Open to Closed or if the Balance is zero for three consecutive months, 1-6

23 Introduction Preparing the Data then a customer is defined as having left that is, no longer holds a credit card account. Preparing the Data After a business opportunity has been identified and defined, the next task is to prepare a data set for mining. This is done in Steps 2 through 6 of the knowledge discovery process. See The Knowledge Discovery Process on page 1-3. The first two steps are preprocessing the mining data to make it consistent and then loading the data into a database system. For further information, see Loading the Data on page 2-2. The next step is to generate a variety of statistics for example, column unique entry counts, value ranges, number of missing values, mean, variance, and so on. This type of data profile is helpful in gaining an understanding of the data, and this profile also serves as a valuable reference throughout the knowledge discovery process. Profiling the Data A profile of the database helps to solve the data mining problem in these ways: To better understand the data To decide which columns to use for analysis To decide whether to treat attributes as discrete or continuous Types of Information The type of information used to create a profile of the data mining view comes from the following elements: Tables in the database Table attributes (or columns to be used in the analysis) Data types of the table attributes Relationships between tables Cardinalities of discrete attributes Statistics about continuous attributes Derived table attributes (or derived columns to be used in the analysis) Determining the derived columns to be constructed requires knowledge of the table attributes and how these attributes relate to the data mining problem. See Preparing the Data on page 1-7 for a full discussion of these elements. SQL/MX provides the TRANSPOSE clause of the SELECT statement to display the cardinalities of discrete attributes. See Transposition on page 2-3 and the TRANSPOSE Clause entry in the SQL/MX Reference Manual for details. Example of Finding Cardinality of Discrete Attributes The customers table in your data set has Age and Number_Children columns. Both of these attributes are discrete, and you can compute the cardinality of each attribute. 1-7

24 Introduction Preparing the Data You obtain the cardinality of an attribute, which is the count of the number of unique values for the attribute, by using a COUNT DISTINCT query. For example: SELECT COUNT(DISTINCT Age) FROM Customers; or SELECT COUNT(DISTINCT Number_Children) FROM Customers; Instead of having to submit a query for each attribute, you can obtain counts for multiple attributes of a table by using the TRANSPOSE clause. For example: SET NAMETYPE ANSI; SET SCHEMA dmcat.whse; SELECT ColumnIndex, COUNT(DISTINCT ColumnValue) FROM Customers TRANSPOSE Age, Number_Children AS ColumnValue KEY BY ColumnIndex GROUP BY ColumnIndex; COLUMNINDEX (EXPR) row(s) selected. The first row of the result table of the TRANSPOSE clause contains the distinct count for the column Age, and the second row contains the distinct count for the column Number_Children. You can treat the Age values as categories, consisting of age ranges. Similarly, if Number_Children is greater than five, you can put the count into the category for the Number_Children equal to five. The number of attributes in a TRANSPOSE clause is unlimited. Note. The data types of attributes to be transformed into a single column must be compatible. The data type of the result column is the union compatible data type of the attributes. For further information, see Profiling the Data on page 2-2. Defining Events In the scenario considered in this manual, the relevant event is the account holder leaving. This event occurs at different points in time for customers that leave and not at all for customers that stay. This event must be defined so that account status and activity in the months leading up to a customer leaving can be located and aligned in columns. For example, suppose you create three derived attributes that describe the account balance for each of the 1-8

25 Introduction Preparing the Data three months before a customer leaves, because these attributes are predictors of attrition. For the customers that do leave, the months leading up to leaving occur at various points in time. For customers that do not leave, these months are chosen to be any three consecutive months in which the account is open. The information about these months should be aligned for all accounts in a single set of columns, one for each of the three months. Most mining algorithms require a single logical attribute, such as the balance one month before leaving, to be stored in one column in all records, rather than in different columns in different records. For example, consider this data in a table that contains monthly account balances for each month in the three-year history period: Account... Bal 08/03 Bal 09/03 Bal 10/03 Bal 11/03... Left Yes (closed) Yes (0 bal) Account... Bal 07/02 Bal 08/02 Bal 09/02 Bal 10/02... Left Yes (closed) The balances prior to the event (of the customer leaving) are in different date columns for these accounts, and therefore algorithms that build predictive models are not able to consider this information. A table organization that allows this information to be considered: Account... Bal-3 Bal-2 Bal-1 Date Left... Left /03 Yes (closed) /03 Yes (0 bal) /03 Yes (closed) In this table, columns Bal-1 through Bal-3 contain account balances one through three months prior to a customer leaving. Consequently, this information is aligned within a single set of columns and can be considered during model creation. For further information, see Defining Events on page 2-6. Deriving Attributes The next task is to derive attributes that are not relative to events. For example, customer age can be derived from birth date. Part of the challenge of effective data mining is identifying a set of derived attributes that capture key indicators relevant to the business opportunity being explored. For further information, see Deriving Attributes on page

26 Introduction Creating the Mining View Creating the Mining View The final data preparation step is to transform the data set into a mining view, a form in which all attributes about the main mining entity appear in a single record. The mining entity used in this manual is a credit card account. The data mining challenge is to determine predictors for when a customer will close a credit card account. Transforming the data set to a single record for each mining entity often involves a pivot operation, in which attributes in multiple rows are collapsed and put into a single row. For example, in the credit card example, the set of history records associated with each account is collapsed to a single record and then appended to the corresponding customer record. For further information, see Section 3, Creating the Data Mining View. The resulting table looks similar to this: Mining View Account Mar Status Income Bal-3 Bal-2 Bal-1 Date Left Left Single 65, /99 Yes Divorced 32, /99 Yes Divorced 44, /98 Yes Married 32,000 No This table contains demographic information from the Customers table, such as marital status and income, and also pivoted columns from the Account History table, such as balances prior to leaving. You use example data set in the data mining step, the next step in the knowledge discovery process. Mining the Data In the data mining step, core knowledge discovery techniques are applied to gain insight, learn patterns, or verify hypotheses. The main tasks performed in this step are either predictive or descriptive in nature. Predictive tasks involve trying to determine what will happen in the future, based upon historical data. Descriptive tasks involve finding patterns describing the data. The task used in this customer scenario is predictive: to build a model to predict attrition of credit card customers based on historical information, such as demographics and account activity. The most common predictive tasks are: Classification Classify a case (or record) into one of several predefined classes. Regression Map a case (or record) into a numerical prediction value. 1-10

27 Introduction Knowledge Deployment and Monitoring Descriptive tasks involve finding patterns describing the data. The most common are: Database segmentation (clustering) Map a case into one of several clusters. Summarization Provide a compact description of the data, often in visual form. Link analysis Determine relationships between attributes in a case. Sequence analysis Determine trends over time. You use a variety of algorithms, and the models they produce, to perform these predictive and descriptive tasks. For example, classification can be done by building a decision tree model, where each branch of the tree is represented by a predicate involving attributes in the mining data set and where each branch is homogeneous with respect to whether the predicate is true or false. The main task in classification is to determine which predicates form the decision tree that predicts the goal. The most common algorithms for classification come from the field of machine learning in computer science. Typically, the model building step involves the use of client-mining tools that require the interactive participation of the user to guide the investigation. A description of these special-purpose tools is beyond the scope of this manual. For further information, see Section 4, Mining the Data. Knowledge Deployment and Monitoring The last two steps of the knowledge discovery process involve deploying and monitoring discovered knowledge. Deployment can take many different forms. For example, deployment might be as simple as documenting and reporting the results, or deployment might be embedding the model in an operational system to achieve predictive results. Most data mining tools support model deployment either by applying a model to data within the tool or by exporting a model as executable code, which can then be embedded and used in applications. In the credit card attrition example, one form of model deployment is to periodically use the model to identify profitable customers that are likely to leave, and then to take some action, such as lowering interest rates or waiving fees, to try to retain these customers. 1-11

28 Introduction Knowledge Deployment and Monitoring 1-12

29 2 Preparing the Data Section 1, Introduction identifies and defines a business opportunity, the first step in the knowledge discovery process supported by SQL/MX. This section describes Steps 2 through Identify and define a business opportunity. 2. Preprocess and load the data for the business opportunity. The first preparation step is to address these problems by preprocessing the data in various ways for example, verifying and mapping the data. Then load the data into your database system. See Loading the Data on page Profile and understand the relevant data. Generate a variety of statistics, such as column unique entry counts, value ranges, number of missing values, mean, variance, and so on. See Profiling the Data on page Define events relevant to the business opportunity being explored. Events are used to align related data in a single set of columns for mining. Example events are life changes, such as getting married or switching jobs, or customer actions, such as opening an account or requesting a credit limit increase. See Defining Events on page Derive attributes. For example, customer age can be derived from birth date. Account summary statistics, such as maximum and minimum balances, can be derived from monthly status information. See Deriving Attributes on page Create the data mining view. 7. Mine the data and build models. 8. Deploy models. 9. Monitor model performance. 2-1

30 Preparing the Data Loading the Data Loading the Data The first step in preparing a data set for mining is loading the data into database tables. Suppose the credit card organization has a customers data warehouse. The customer data and the account history data are stored in this warehouse. In a typical real-world scenario, the warehouse could have millions of records representing millions of customers dating back many years. Creating the Database Suppose a data mining database is created consisting of the Customers table and the Account History table described in the previous section. You can use the DDL scripts included with this manual to create a database to run the examples in this manual. To create the database: 1. Open the.pdf file for this manual. 2. Navigate to Appendix A, Creating the Data Mining Database of this manual, which contains the DDL script that creates the database. 3. On the tool bar, select the Table/Formatted Text Select Tool. 4. Copy and paste from the DDL script, one page at a time, into an OSS text file. 5. Within MXCI (the SQL/MX conversational interface), obey the OSS file you have created. Importing Data Into the Database After the data mining database is created, the warehouse data is imported into the database. In a typical real-world scenario, you would import the data by using some type of database utility for example, you can use the DataLoader/MP utility to import a large quantity of data into an SQL/MP database. For further information, see the DataLoader/MX Reference Manual and the SQL/MX Reference Manual for discussions of the Import Utility. Alternatively, you can also use INSERT statements to insert values into the data mining database. The INSERT statements for the example in this manual are included in Appendix B, Inserting Into the Data Mining Database. Profiling the Data Profiling often begins with the computation of basic information about each attribute. For discrete attributes, this basic information is typically a table of the unique values and a count of how many times each value occurs. However, as cardinality increases, these frequencies become less and less meaningful. For continuous attributes, the approach is to use metrics such as minimum, maximum, mean, and variance. 2-2

31 Preparing the Data Cardinalities and Metrics Cardinalities and Metrics For any attribute, one approach to profiling is to run a separate query for each attribute. As an example, consider the following queries, which profile the discrete attribute Marital Status from the Customers table and the continuous attribute Balance from the Account History table. Example of Discrete Attribute This query finds the number of discrete values of the Marital Status column of the Customers table: SELECT marital_status, COUNT(*) FROM customers GROUP BY marital_status; Example of Continuous Attribute This query computes statistical information about the continuous attribute Balance in the Account History table: SELECT MIN(balance), MAX(balance), AVG(balance), VARIANCE(balance) FROM acct_history; Transposition Other than the computation of a few metrics, both of the previous queries require a complete scan of the data. In this way, a table with N attributes requires N queries, resulting in the same number of complete scans. For a wide mining table, this procedure can result in thousands of queries and scans of the data. Using transposition, SQL/MX can perform the above profiling operations by using a total of only two queries, regardless of the number of attributes to be profiled. Through the TRANSPOSE clause of the SELECT statement, different columns of a source table can be treated as a single output column, enabling similar computations to be performed on all such source columns. TRANSPOSE takes each row in the source table and converts each expression listed in the transpose set to an individual output row. Used in this way, TRANSPOSE can compute frequency counts for all discrete attributes in a table in a single query. See the TRANSPOSE Clause entry in the SQL/MX Reference Manual for more information. Example of Computing Counts for Character Discrete Attributes This query computes the frequency counts for the discrete attributes Gender, Marital Status, and Home, which are all type character: SET NAMETYPE ANSI; SET SCHEMA mining.whse; 2-3

32 Preparing the Data Transposition SELECT attr, c1, COUNT(*) FROM customers TRANSPOSE ('GENDER', gender), ('HOME', home), ('MARITAL_STATUS', marital_status) AS (attr, c1) GROUP BY attr, c1 ORDER BY attr, c1; ATTR C1 (EXPR) GENDER F 20 GENDER M 22 HOME Own 33 HOME Rent 9 MARITAL_STATUS Divorced 12 MARITAL_STATUS Married 9 MARITAL_STATUS Single 15 MARITAL_STATUS Widow row(s) selected. Because this query produces counts for three different attributes, use the ATTR column to distinguish from which attribute the values are drawn. The C1 column contains the values for these character attributes. Example of Computing Counts for Character and Numeric Discrete Attributes This query also shows the transpose clause and illustrates how profiling can be achieved. The column C2 has been added to the statement because Number_Children has numeric data type. SELECT attr, c1, c2, COUNT(*) FROM customers TRANSPOSE ('GENDER', gender, null), ('HOME', home, null), ('MARITAL_STATUS', marital_status, null), ('NUMBER_CHILDREN', null, number_children) AS (attr, c1, c2) GROUP BY attr, c1, c2 ORDER BY attr, c1, c2; ATTR C1 C2 (EXPR) GENDER F? 20 GENDER M? 22 HOME Own? 33 HOME Rent? 9 MARITAL_STATUS Divorced? 12 MARITAL_STATUS Married? 9 MARITAL_STATUS Single? 15 MARITAL_STATUS Widow? 6 NUMBER_CHILDREN? 0 25 NUMBER_CHILDREN?

HP NonStop SQL/MX Report Writer Guide

HP NonStop SQL/MX Report Writer Guide HP NonStop SQL/MX Report Writer Guide Abstract This manual explains how to use the HP NonStop SQL/MX report writer commands, clauses, and functions, and the MXCI options that relate to reports The manual

More information

HP NonStop SQL/MX Release 3.1 Database and Application Migration Guide

HP NonStop SQL/MX Release 3.1 Database and Application Migration Guide HP NonStop SQL/MX Release 3.1 Database and Application Migration Guide Abstract This manual explains how to migrate databases and applications from SQL/MX Release 2.3.x and SQL/MX Release 3.0 to SQL/MX

More information

HP NonStop SQL Programming Manual for TAL

HP NonStop SQL Programming Manual for TAL HP NonStop SQL Programming Manual for TAL Abstract This manual describes the programmatic interface to HP NonStop SQL for the Transaction Application Language (TAL). The NonStop SQL relational database

More information

HP NonStop SQL/MP Query Guide

HP NonStop SQL/MP Query Guide HP NonStop SQL/MP Query Guide Abstract This manual describes how to write queries for an HP NonStop SQL/MP database. Users who want information on how to use the SELECT statement, as well as those who

More information

HP NonStop TS/MP Pathsend and Server Programming Manual

HP NonStop TS/MP Pathsend and Server Programming Manual HP NonStop TS/MP Pathsend and Server Programming Manual HP Part Number: 542660-007 Published: February 2012 Edition: J06.03 and all subsequent J-series RVUs and H06.05 and all subsequent H-series RVUs

More information

HP NonStop SQL/MX Release 3.2.1 Management Guide

HP NonStop SQL/MX Release 3.2.1 Management Guide HP NonStop SQL/MX Release 3.2.1 Management Guide HP Part Number: 691120-002 Published: February 2013 Edition: J06.14 and subsequent J-series RVUs; H06.25 and subsequent H-series RVUs Copyright 2013 Hewlett-Packard

More information

NonStop Server for Java Message Service C++ API Programmer s Guide

NonStop Server for Java Message Service C++ API Programmer s Guide NonStop Server for Java Message Service C++ API Programmer s Guide Abstract NonStop Server for Java Message Service (NSJMS) is the JMS provider that implements Sun Microsystems Java Message Service (JMS)

More information

HP NonStop Time Synchronization User s Guide

HP NonStop Time Synchronization User s Guide HP NonStop Time Synchronization User s Guide Abstract HP NonStop Time Synchronization (TimeSync) synchronizes time between HP NonStop servers, Microsoft Windows systems, and Linux systems. It can act as

More information

itp Secure WebServer System Administrator s Guide

itp Secure WebServer System Administrator s Guide itp Secure WebServer System Administrator s Guide HP Part Number: 629959-006 Published: February 2014 Edition: J06.10 and subsequent J-series RVUs and H06.21 and subsequent H-series RVUs. Copyright 2014

More information

SNAX/XF LU Network Services Manual

SNAX/XF LU Network Services Manual Networking and Data Communications Library SNAX/XF LU Network Services Manual Abstract Part Number 105782 Edition This manual is directed to systems managers and systems programmers and describes how to

More information

Bachelors of Computer Application Programming Principle & Algorithm (BCA-S102T)

Bachelors of Computer Application Programming Principle & Algorithm (BCA-S102T) Unit- I Introduction to c Language: C is a general-purpose computer programming language developed between 1969 and 1973 by Dennis Ritchie at the Bell Telephone Laboratories for use with the Unix operating

More information

Litle & Co. Scheduled Secure Report Reference Guide. August 2013. Document Version: 1.8

Litle & Co. Scheduled Secure Report Reference Guide. August 2013. Document Version: 1.8 Litle & Co. Scheduled Secure Report Reference Guide August 2013 Document Version: 1.8 Litle & Co. Scheduled Secure Report Reference Guide Document Version: 1.8 All information whether text or graphics,

More information

HP DNS Configuration and Management Manual

HP DNS Configuration and Management Manual HP DNS Configuration and Management Manual Abstract This manual describes how to implement Domain Name System (DNS) on the HP NonStop in the HP NonStop Open System Services (OSS) environment. This manual

More information

Name: Class: Date: 9. The compiler ignores all comments they are there strictly for the convenience of anyone reading the program.

Name: Class: Date: 9. The compiler ignores all comments they are there strictly for the convenience of anyone reading the program. Name: Class: Date: Exam #1 - Prep True/False Indicate whether the statement is true or false. 1. Programming is the process of writing a computer program in a language that the computer can respond to

More information

SQL Server An Overview

SQL Server An Overview SQL Server An Overview SQL Server Microsoft SQL Server is designed to work effectively in a number of environments: As a two-tier or multi-tier client/server database system As a desktop database system

More information

Preparing Data Sets for the Data Mining Analysis using the Most Efficient Horizontal Aggregation Method in SQL

Preparing Data Sets for the Data Mining Analysis using the Most Efficient Horizontal Aggregation Method in SQL Preparing Data Sets for the Data Mining Analysis using the Most Efficient Horizontal Aggregation Method in SQL Jasna S MTech Student TKM College of engineering Kollam Manu J Pillai Assistant Professor

More information

KB_SQL SQL Reference Guide Version 4

KB_SQL SQL Reference Guide Version 4 KB_SQL SQL Reference Guide Version 4 1995, 1999 by KB Systems, Inc. All rights reserved. KB Systems, Inc., Herndon, Virginia, USA. Printed in the United States of America. No part of this manual may be

More information

Instant SQL Programming

Instant SQL Programming Instant SQL Programming Joe Celko Wrox Press Ltd. INSTANT Table of Contents Introduction 1 What Can SQL Do for Me? 2 Who Should Use This Book? 2 How To Use This Book 3 What You Should Know 3 Conventions

More information

3.GETTING STARTED WITH ORACLE8i

3.GETTING STARTED WITH ORACLE8i Oracle For Beginners Page : 1 3.GETTING STARTED WITH ORACLE8i Creating a table Datatypes Displaying table definition using DESCRIBE Inserting rows into a table Selecting rows from a table Editing SQL buffer

More information

Retrieving Data Using the SQL SELECT Statement. Copyright 2006, Oracle. All rights reserved.

Retrieving Data Using the SQL SELECT Statement. Copyright 2006, Oracle. All rights reserved. Retrieving Data Using the SQL SELECT Statement Objectives After completing this lesson, you should be able to do the following: List the capabilities of SQL SELECT statements Execute a basic SELECT statement

More information

Rational Rational ClearQuest

Rational Rational ClearQuest Rational Rational ClearQuest Version 7.0 Windows Using Project Tracker GI11-6377-00 Rational Rational ClearQuest Version 7.0 Windows Using Project Tracker GI11-6377-00 Before using this information, be

More information

1 File Processing Systems

1 File Processing Systems COMP 378 Database Systems Notes for Chapter 1 of Database System Concepts Introduction A database management system (DBMS) is a collection of data and an integrated set of programs that access that data.

More information

Toad for Oracle 8.6 SQL Tuning

Toad for Oracle 8.6 SQL Tuning Quick User Guide for Toad for Oracle 8.6 SQL Tuning SQL Tuning Version 6.1.1 SQL Tuning definitively solves SQL bottlenecks through a unique methodology that scans code, without executing programs, to

More information

Data Mining Extensions (DMX) Reference

Data Mining Extensions (DMX) Reference Data Mining Extensions (DMX) Reference SQL Server 2012 Books Online Summary: Data Mining Extensions (DMX) is a language that you can use to create and work with data mining models in Microsoft SQL Server

More information

HP NonStop SQL/MX Query Guide

HP NonStop SQL/MX Query Guide HP NonStop SQL/MX Query Guide Abstract This guide describes how to understand query execution plans and write optimal queries for an HP NonStop SQL/MX database. It is intended for database administrators

More information

Dataset Preparation and Indexing for Data Mining Analysis Using Horizontal Aggregations

Dataset Preparation and Indexing for Data Mining Analysis Using Horizontal Aggregations Dataset Preparation and Indexing for Data Mining Analysis Using Horizontal Aggregations Binomol George, Ambily Balaram Abstract To analyze data efficiently, data mining systems are widely using datasets

More information

Data Warehouse Center Administration Guide

Data Warehouse Center Administration Guide IBM DB2 Universal Database Data Warehouse Center Administration Guide Version 8 SC27-1123-00 IBM DB2 Universal Database Data Warehouse Center Administration Guide Version 8 SC27-1123-00 Before using this

More information

Business Intelligence Tutorial

Business Intelligence Tutorial IBM DB2 Universal Database Business Intelligence Tutorial Version 7 IBM DB2 Universal Database Business Intelligence Tutorial Version 7 Before using this information and the product it supports, be sure

More information

Introduction to Microsoft Jet SQL

Introduction to Microsoft Jet SQL Introduction to Microsoft Jet SQL Microsoft Jet SQL is a relational database language based on the SQL 1989 standard of the American Standards Institute (ANSI). Microsoft Jet SQL contains two kinds of

More information

Oracle SQL. Course Summary. Duration. Objectives

Oracle SQL. Course Summary. Duration. Objectives Oracle SQL Course Summary Identify the major structural components of the Oracle Database 11g Create reports of aggregated data Write SELECT statements that include queries Retrieve row and column data

More information

Oracle Database 12c: Introduction to SQL Ed 1.1

Oracle Database 12c: Introduction to SQL Ed 1.1 Oracle University Contact Us: 1.800.529.0165 Oracle Database 12c: Introduction to SQL Ed 1.1 Duration: 5 Days What you will learn This Oracle Database: Introduction to SQL training helps you write subqueries,

More information

SAS 9.4 Intelligence Platform: Migration Guide, Second Edition

SAS 9.4 Intelligence Platform: Migration Guide, Second Edition SAS 9.4 Intelligence Platform: Migration Guide, Second Edition SAS Documentation The correct bibliographic citation for this manual is as follows: SAS Institute Inc. 2015. SAS 9.4 Intelligence Platform:

More information

MS Access: Advanced Tables and Queries. Lesson Notes Author: Pamela Schmidt

MS Access: Advanced Tables and Queries. Lesson Notes Author: Pamela Schmidt Lesson Notes Author: Pamela Schmidt Tables Text Fields (Default) Text or combinations of text and numbers, as well as numbers that don't require calculations, such as phone numbers. or the length set by

More information

Preface. DirXmetahub Document Set

Preface. DirXmetahub Document Set Preface DirXmetahub Document Set Preface This manual is reference for the DirXmetahub meta agents. It consists of the following sections: Chapter 1 introduces the set of DirXmetahub meta agents. Chapter

More information

White Paper April 2006

White Paper April 2006 White Paper April 2006 Table of Contents 1. Executive Summary...4 1.1 Scorecards...4 1.2 Alerts...4 1.3 Data Collection Agents...4 1.4 Self Tuning Caching System...4 2. Business Intelligence Model...5

More information

Postgres Plus xdb Replication Server with Multi-Master User s Guide

Postgres Plus xdb Replication Server with Multi-Master User s Guide Postgres Plus xdb Replication Server with Multi-Master User s Guide Postgres Plus xdb Replication Server with Multi-Master build 57 August 22, 2012 , Version 5.0 by EnterpriseDB Corporation Copyright 2012

More information

Guide to Performance and Tuning: Query Performance and Sampled Selectivity

Guide to Performance and Tuning: Query Performance and Sampled Selectivity Guide to Performance and Tuning: Query Performance and Sampled Selectivity A feature of Oracle Rdb By Claude Proteau Oracle Rdb Relational Technology Group Oracle Corporation 1 Oracle Rdb Journal Sampled

More information

Duration Vendor Audience 5 Days Oracle End Users, Developers, Technical Consultants and Support Staff

Duration Vendor Audience 5 Days Oracle End Users, Developers, Technical Consultants and Support Staff D80198GC10 Oracle Database 12c SQL and Fundamentals Summary Duration Vendor Audience 5 Days Oracle End Users, Developers, Technical Consultants and Support Staff Level Professional Delivery Method Instructor-led

More information

Silect Software s MP Author

Silect Software s MP Author Silect MP Author for Microsoft System Center Operations Manager Silect Software s MP Author User Guide September 2, 2015 Disclaimer The information in this document is furnished for informational use only,

More information

SAP Business Objects Business Intelligence platform Document Version: 4.1 Support Package 7 2015-11-24. Data Federation Administration Tool Guide

SAP Business Objects Business Intelligence platform Document Version: 4.1 Support Package 7 2015-11-24. Data Federation Administration Tool Guide SAP Business Objects Business Intelligence platform Document Version: 4.1 Support Package 7 2015-11-24 Data Federation Administration Tool Guide Content 1 What's new in the.... 5 2 Introduction to administration

More information

HP NonStop ODBC/MX Client Drivers User Guide for SQL/MX Release 3.2.1

HP NonStop ODBC/MX Client Drivers User Guide for SQL/MX Release 3.2.1 HP NonStop ODBC/MX Client Drivers User Guide for SQL/MX Release 3.2.1 HP Part Number: 734873-002 Published: November 2013 Edition: J06.16 and subsequent J-series RVUs; H06.27 and subsequent H-series RVUs

More information

CHAPTER 2 DATABASE MANAGEMENT SYSTEM AND SECURITY

CHAPTER 2 DATABASE MANAGEMENT SYSTEM AND SECURITY CHAPTER 2 DATABASE MANAGEMENT SYSTEM AND SECURITY 2.1 Introduction In this chapter, I am going to introduce Database Management Systems (DBMS) and the Structured Query Language (SQL), its syntax and usage.

More information

INTRODUCTION TO DATA MINING SAS ENTERPRISE MINER

INTRODUCTION TO DATA MINING SAS ENTERPRISE MINER INTRODUCTION TO DATA MINING SAS ENTERPRISE MINER Mary-Elizabeth ( M-E ) Eddlestone Principal Systems Engineer, Analytics SAS Customer Loyalty, SAS Institute, Inc. AGENDA Overview/Introduction to Data Mining

More information

Database Marketing, Business Intelligence and Knowledge Discovery

Database Marketing, Business Intelligence and Knowledge Discovery Database Marketing, Business Intelligence and Knowledge Discovery Note: Using material from Tan / Steinbach / Kumar (2005) Introduction to Data Mining,, Addison Wesley; and Cios / Pedrycz / Swiniarski

More information

NonStop SQL Database Management

NonStop SQL Database Management NonStop SQL Database Management I have always been, and always will be, what has been referred to as a command line cowboy. I go through keyboards faster than most people go through mobile phones. I know

More information

Microsoft Access 3: Understanding and Creating Queries

Microsoft Access 3: Understanding and Creating Queries Microsoft Access 3: Understanding and Creating Queries In Access Level 2, we learned how to perform basic data retrievals by using Search & Replace functions and Sort & Filter functions. For more complex

More information

not possible or was possible at a high cost for collecting the data.

not possible or was possible at a high cost for collecting the data. Data Mining and Knowledge Discovery Generating knowledge from data Knowledge Discovery Data Mining White Paper Organizations collect a vast amount of data in the process of carrying out their day-to-day

More information

SPATIAL DATA CLASSIFICATION AND DATA MINING

SPATIAL DATA CLASSIFICATION AND DATA MINING , pp.-40-44. Available online at http://www. bioinfo. in/contents. php?id=42 SPATIAL DATA CLASSIFICATION AND DATA MINING RATHI J.B. * AND PATIL A.D. Department of Computer Science & Engineering, Jawaharlal

More information

SQL Server. 2012 for developers. murach's TRAINING & REFERENCE. Bryan Syverson. Mike Murach & Associates, Inc. Joel Murach

SQL Server. 2012 for developers. murach's TRAINING & REFERENCE. Bryan Syverson. Mike Murach & Associates, Inc. Joel Murach TRAINING & REFERENCE murach's SQL Server 2012 for developers Bryan Syverson Joel Murach Mike Murach & Associates, Inc. 4340 N. Knoll Ave. Fresno, CA 93722 www.murach.com murachbooks@murach.com Expanded

More information

TIBCO ActiveMatrix BusinessWorks Plug-in for Big Data User s Guide

TIBCO ActiveMatrix BusinessWorks Plug-in for Big Data User s Guide TIBCO ActiveMatrix BusinessWorks Plug-in for Big Data User s Guide Software Release 1.0 November 2013 Two-Second Advantage Important Information SOME TIBCO SOFTWARE EMBEDS OR BUNDLES OTHER TIBCO SOFTWARE.

More information

EMC NetWorker Module for Microsoft for Windows Bare Metal Recovery Solution

EMC NetWorker Module for Microsoft for Windows Bare Metal Recovery Solution EMC NetWorker Module for Microsoft for Windows Bare Metal Recovery Solution Release 3.0 User Guide P/N 300-999-671 REV 02 Copyright 2007-2013 EMC Corporation. All rights reserved. Published in the USA.

More information

Oracle Database: SQL and PL/SQL Fundamentals

Oracle Database: SQL and PL/SQL Fundamentals Oracle University Contact Us: 1.800.529.0165 Oracle Database: SQL and PL/SQL Fundamentals Duration: 5 Days What you will learn This course is designed to deliver the fundamentals of SQL and PL/SQL along

More information

Model Deployment. Dr. Saed Sayad. University of Toronto 2010 saed.sayad@utoronto.ca. http://chem-eng.utoronto.ca/~datamining/

Model Deployment. Dr. Saed Sayad. University of Toronto 2010 saed.sayad@utoronto.ca. http://chem-eng.utoronto.ca/~datamining/ Model Deployment Dr. Saed Sayad University of Toronto 2010 saed.sayad@utoronto.ca http://chem-eng.utoronto.ca/~datamining/ 1 Model Deployment Creation of the model is generally not the end of the project.

More information

Delivering Business Intelligence With Microsoft SQL Server 2005 or 2008 HDT922 Five Days

Delivering Business Intelligence With Microsoft SQL Server 2005 or 2008 HDT922 Five Days or 2008 Five Days Prerequisites Students should have experience with any relational database management system as well as experience with data warehouses and star schemas. It would be helpful if students

More information

STATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and

STATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and Clustering Techniques and STATISTICA Case Study: Defining Clusters of Shopping Center Patrons STATISTICA Solutions for Business Intelligence, Data Mining, Quality Control, and Web-based Analytics Table

More information

HP NonStop SQL DDL Replicator User s Guide

HP NonStop SQL DDL Replicator User s Guide HP NonStop SQL DDL Replicator User s Guide Abstract HP NonStop SQL DDL Replicator Software replicates NonStop SQL DDL operations to one or more backup systems. Product Version NonStop SQL DDL Replicator

More information

Extend Table Lens for High-Dimensional Data Visualization and Classification Mining

Extend Table Lens for High-Dimensional Data Visualization and Classification Mining Extend Table Lens for High-Dimensional Data Visualization and Classification Mining CPSC 533c, Information Visualization Course Project, Term 2 2003 Fengdong Du fdu@cs.ubc.ca University of British Columbia

More information

MOC 20461C: Querying Microsoft SQL Server. Course Overview

MOC 20461C: Querying Microsoft SQL Server. Course Overview MOC 20461C: Querying Microsoft SQL Server Course Overview This course provides students with the knowledge and skills to query Microsoft SQL Server. Students will learn about T-SQL querying, SQL Server

More information

Oracle Database 10g: Introduction to SQL

Oracle Database 10g: Introduction to SQL Oracle University Contact Us: 1.800.529.0165 Oracle Database 10g: Introduction to SQL Duration: 5 Days What you will learn This course offers students an introduction to Oracle Database 10g database technology.

More information

Access Queries (Office 2003)

Access Queries (Office 2003) Access Queries (Office 2003) Technical Support Services Office of Information Technology, West Virginia University OIT Help Desk 293-4444 x 1 oit.wvu.edu/support/training/classmat/db/ Instructor: Kathy

More information

Kiwi Log Viewer. A Freeware Log Viewer for Windows. by SolarWinds, Inc.

Kiwi Log Viewer. A Freeware Log Viewer for Windows. by SolarWinds, Inc. Kiwi Log Viewer A Freeware Log Viewer for Windows by SolarWinds, Inc. Kiwi Log Viewer displays text based log files in a tabular format. Only a small section of the file is read from disk at a time which

More information

Oracle Database: SQL and PL/SQL Fundamentals

Oracle Database: SQL and PL/SQL Fundamentals Oracle University Contact Us: +966 12 739 894 Oracle Database: SQL and PL/SQL Fundamentals Duration: 5 Days What you will learn This Oracle Database: SQL and PL/SQL Fundamentals training is designed to

More information

DATA MINING TECHNOLOGY. Keywords: data mining, data warehouse, knowledge discovery, OLAP, OLAM.

DATA MINING TECHNOLOGY. Keywords: data mining, data warehouse, knowledge discovery, OLAP, OLAM. DATA MINING TECHNOLOGY Georgiana Marin 1 Abstract In terms of data processing, classical statistical models are restrictive; it requires hypotheses, the knowledge and experience of specialists, equations,

More information

OLAP and Data Mining. Data Warehousing and End-User Access Tools. Introducing OLAP. Introducing OLAP

OLAP and Data Mining. Data Warehousing and End-User Access Tools. Introducing OLAP. Introducing OLAP Data Warehousing and End-User Access Tools OLAP and Data Mining Accompanying growth in data warehouses is increasing demands for more powerful access tools providing advanced analytical capabilities. Key

More information

Chapter 10 Practical Database Design Methodology and Use of UML Diagrams

Chapter 10 Practical Database Design Methodology and Use of UML Diagrams Chapter 10 Practical Database Design Methodology and Use of UML Diagrams Copyright 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 10 Outline The Role of Information Systems in

More information

Master of Science in Healthcare Informatics and Analytics Program Overview

Master of Science in Healthcare Informatics and Analytics Program Overview Master of Science in Healthcare Informatics and Analytics Program Overview The program is a 60 credit, 100 week course of study that is designed to graduate students who: Understand and can apply the appropriate

More information

Section 1.4 Place Value Systems of Numeration in Other Bases

Section 1.4 Place Value Systems of Numeration in Other Bases Section.4 Place Value Systems of Numeration in Other Bases Other Bases The Hindu-Arabic system that is used in most of the world today is a positional value system with a base of ten. The simplest reason

More information

ETPL Extract, Transform, Predict and Load

ETPL Extract, Transform, Predict and Load ETPL Extract, Transform, Predict and Load An Oracle White Paper March 2006 ETPL Extract, Transform, Predict and Load. Executive summary... 2 Why Extract, transform, predict and load?... 4 Basic requirements

More information

Expert Oracle Exadata

Expert Oracle Exadata Expert Oracle Exadata Kerry Osborne Randy Johnson Tanel Poder Apress Contents J m About the Authors About the Technical Reviewer a Acknowledgments Introduction xvi xvii xviii xix Chapter 1: What Is Exadata?

More information

Query Optimization Approach in SQL to prepare Data Sets for Data Mining Analysis

Query Optimization Approach in SQL to prepare Data Sets for Data Mining Analysis Query Optimization Approach in SQL to prepare Data Sets for Data Mining Analysis Rajesh Reddy Muley 1, Sravani Achanta 2, Prof.S.V.Achutha Rao 3 1 pursuing M.Tech(CSE), Vikas College of Engineering and

More information

Base Conversion written by Cathy Saxton

Base Conversion written by Cathy Saxton Base Conversion written by Cathy Saxton 1. Base 10 In base 10, the digits, from right to left, specify the 1 s, 10 s, 100 s, 1000 s, etc. These are powers of 10 (10 x ): 10 0 = 1, 10 1 = 10, 10 2 = 100,

More information

Demystified CONTENTS Acknowledgments xvii Introduction xix CHAPTER 1 Database Fundamentals CHAPTER 2 Exploring Relational Database Components

Demystified CONTENTS Acknowledgments xvii Introduction xix CHAPTER 1 Database Fundamentals CHAPTER 2 Exploring Relational Database Components Acknowledgments xvii Introduction xix CHAPTER 1 Database Fundamentals 1 Properties of a Database 1 The Database Management System (DBMS) 2 Layers of Data Abstraction 3 Physical Data Independence 5 Logical

More information

Chapter 3: Data Mining Driven Learning Apprentice System for Medical Billing Compliance

Chapter 3: Data Mining Driven Learning Apprentice System for Medical Billing Compliance Chapter 3: Data Mining Driven Learning Apprentice System for Medical Billing Compliance 3.1 Introduction This research has been conducted at back office of a medical billing company situated in a custom

More information

Business Enterprise Server Help Desk Integration Guide. Version 3.5

Business Enterprise Server Help Desk Integration Guide. Version 3.5 Business Enterprise Server Help Desk Integration Guide Version 3.5 June 30, 2010 Copyright Copyright 2003 2010 Interlink Software Services, Ltd., as an unpublished work. All rights reserved. Interlink

More information

Expedite for Windows Software Development Kit Programming Guide

Expedite for Windows Software Development Kit Programming Guide GXS EDI Services Expedite for Windows Software Development Kit Programming Guide Version 6 Release 2 GC34-3285-02 Fifth Edition (November 2005) This edition replaces the Version 6.1 edition. Copyright

More information

EMC SourceOne Auditing and Reporting Version 7.0

EMC SourceOne Auditing and Reporting Version 7.0 EMC SourceOne Auditing and Reporting Version 7.0 Installation and Administration Guide 300-015-186 REV 01 EMC Corporation Corporate Headquarters: Hopkinton, MA 01748-9103 1-508-435-1000 www.emc.com Copyright

More information

IBM SPSS Direct Marketing 23

IBM SPSS Direct Marketing 23 IBM SPSS Direct Marketing 23 Note Before using this information and the product it supports, read the information in Notices on page 25. Product Information This edition applies to version 23, release

More information

Heterogeneous Replication Guide. Replication Server 15.5

Heterogeneous Replication Guide. Replication Server 15.5 Heterogeneous Replication Guide Replication Server 15.5 DOCUMENT ID: DC36924-01-1550-01 LAST REVISED: March 2010 Copyright 2010 by Sybase, Inc. All rights reserved. This publication pertains to Sybase

More information

Guidelines for using Microsoft System Center Virtual Machine Manager with HP StorageWorks Storage Mirroring

Guidelines for using Microsoft System Center Virtual Machine Manager with HP StorageWorks Storage Mirroring HP StorageWorks Guidelines for using Microsoft System Center Virtual Machine Manager with HP StorageWorks Storage Mirroring Application Note doc-number Part number: T2558-96337 First edition: June 2009

More information

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014 RESEARCH ARTICLE OPEN ACCESS A Survey of Data Mining: Concepts with Applications and its Future Scope Dr. Zubair Khan 1, Ashish Kumar 2, Sunny Kumar 3 M.Tech Research Scholar 2. Department of Computer

More information

Oracle Database: SQL and PL/SQL Fundamentals NEW

Oracle Database: SQL and PL/SQL Fundamentals NEW Oracle University Contact Us: + 38516306373 Oracle Database: SQL and PL/SQL Fundamentals NEW Duration: 5 Days What you will learn This Oracle Database: SQL and PL/SQL Fundamentals training delivers the

More information

IBM SPSS Direct Marketing 22

IBM SPSS Direct Marketing 22 IBM SPSS Direct Marketing 22 Note Before using this information and the product it supports, read the information in Notices on page 25. Product Information This edition applies to version 22, release

More information

3. Add and delete a cover page...7 Add a cover page... 7 Delete a cover page... 7

3. Add and delete a cover page...7 Add a cover page... 7 Delete a cover page... 7 Microsoft Word: Advanced Features for Publication, Collaboration, and Instruction For your MAC (Word 2011) Presented by: Karen Gray (kagray@vt.edu) Word Help: http://mac2.microsoft.com/help/office/14/en-

More information

12 File and Database Concepts 13 File and Database Concepts A many-to-many relationship means that one record in a particular record type can be relat

12 File and Database Concepts 13 File and Database Concepts A many-to-many relationship means that one record in a particular record type can be relat 1 Databases 2 File and Database Concepts A database is a collection of information Databases are typically stored as computer files A structured file is similar to a card file or Rolodex because it uses

More information

NonStop NET/MASTER D30. MS Operator's Guide. System Software Library

NonStop NET/MASTER D30. MS Operator's Guide. System Software Library System Software Library NonStop NET/MASTER MS Operator's Guide Abstract Part Number 106379 Edition This manual describes the user interface to NonStop NET/MASTER Management Services (MS). It provides the

More information

Database Programming with PL/SQL: Learning Objectives

Database Programming with PL/SQL: Learning Objectives Database Programming with PL/SQL: Learning Objectives This course covers PL/SQL, a procedural language extension to SQL. Through an innovative project-based approach, students learn procedural logic constructs

More information

Sage Abra SQL HRMS Reports. User Guide

Sage Abra SQL HRMS Reports. User Guide Sage Abra SQL HRMS Reports User Guide 2010 Sage Software, Inc. All rights reserved. Sage, the Sage logos, and the Sage product and service names mentioned herein are registered trademarks or trademarks

More information

White Paper. Thirsting for Insight? Quench It With 5 Data Management for Analytics Best Practices.

White Paper. Thirsting for Insight? Quench It With 5 Data Management for Analytics Best Practices. White Paper Thirsting for Insight? Quench It With 5 Data Management for Analytics Best Practices. Contents Data Management: Why It s So Essential... 1 The Basics of Data Preparation... 1 1: Simplify Access

More information

ETL Process in Data Warehouse. G.Lakshmi Priya & Razia Sultana.A Assistant Professor/IT

ETL Process in Data Warehouse. G.Lakshmi Priya & Razia Sultana.A Assistant Professor/IT ETL Process in Data Warehouse G.Lakshmi Priya & Razia Sultana.A Assistant Professor/IT Outline ETL Extraction Transformation Loading ETL Overview Extraction Transformation Loading ETL To get data out of

More information

Oracle Database: SQL and PL/SQL Fundamentals NEW

Oracle Database: SQL and PL/SQL Fundamentals NEW Oracle University Contact Us: 001-855-844-3881 & 001-800-514-06-97 Oracle Database: SQL and PL/SQL Fundamentals NEW Duration: 5 Days What you will learn This Oracle Database: SQL and PL/SQL Fundamentals

More information

A Comparison of Database Query Languages: SQL, SPARQL, CQL, DMX

A Comparison of Database Query Languages: SQL, SPARQL, CQL, DMX ISSN: 2393-8528 Contents lists available at www.ijicse.in International Journal of Innovative Computer Science & Engineering Volume 3 Issue 2; March-April-2016; Page No. 09-13 A Comparison of Database

More information

itp Secure WebServer System Administrator s Guide

itp Secure WebServer System Administrator s Guide itp Secure WebServer System Administrator s Guide Abstract This guide describes how to install, configure, and manage the itp Secure WebServer. It also discusses how to develop and integrate Common Gateway

More information

Chapter 2: Elements of Java

Chapter 2: Elements of Java Chapter 2: Elements of Java Basic components of a Java program Primitive data types Arithmetic expressions Type casting. The String type (introduction) Basic I/O statements Importing packages. 1 Introduction

More information

EMC NetWorker. Licensing Guide. Release 8.0 P/N 300-013-596 REV A01

EMC NetWorker. Licensing Guide. Release 8.0 P/N 300-013-596 REV A01 EMC NetWorker Release 8.0 Licensing Guide P/N 300-013-596 REV A01 Copyright (2011-2012) EMC Corporation. All rights reserved. Published in the USA. Published June, 2012 EMC believes the information in

More information

Participant Guide RP301: Ad Hoc Business Intelligence Reporting

Participant Guide RP301: Ad Hoc Business Intelligence Reporting RP301: Ad Hoc Business Intelligence Reporting State of Kansas As of April 28, 2010 Final TABLE OF CONTENTS Course Overview... 4 Course Objectives... 4 Agenda... 4 Lesson 1: Reviewing the Data Warehouse...

More information

Introduction to Querying & Reporting with SQL Server

Introduction to Querying & Reporting with SQL Server 1800 ULEARN (853 276) www.ddls.com.au Introduction to Querying & Reporting with SQL Server Length 5 days Price $4169.00 (inc GST) Overview This five-day instructor led course provides students with the

More information

FileMaker 12. ODBC and JDBC Guide

FileMaker 12. ODBC and JDBC Guide FileMaker 12 ODBC and JDBC Guide 2004 2012 FileMaker, Inc. All Rights Reserved. FileMaker, Inc. 5201 Patrick Henry Drive Santa Clara, California 95054 FileMaker and Bento are trademarks of FileMaker, Inc.

More information

Horizontal Aggregations In SQL To Generate Data Sets For Data Mining Analysis In An Optimized Manner

Horizontal Aggregations In SQL To Generate Data Sets For Data Mining Analysis In An Optimized Manner 24 Horizontal Aggregations In SQL To Generate Data Sets For Data Mining Analysis In An Optimized Manner Rekha S. Nyaykhor M. Tech, Dept. Of CSE, Priyadarshini Bhagwati College of Engineering, Nagpur, India

More information