Data Mining Standards

Transcription

1 Data Mining Standards Arati Kadav Jaya Kawale Pabitra Mitra Abstract In this survey paper we have consolidated all the current data mining standards. We have categorized them in to process standards, XML standards, standard APIs, web standards and grid standards and discussed them in considerable detail. We have also designed an application using these standards. We later also analyze the standards their influence on data mining application development and later point out areas in the data mining application development that need to be standardized. We also talk about the trend in the focus areas addressed by these standards.

2 Data Mining Standards Introduction Data Mining Standards Process Standards CRISP-DM XML Standards/ OR Model defining standards<todo> PMML CWM-DM Web Standards XMLA Semantic Web Data Space Application Programming Interfaces (APIs) SQL/ MM DM Java API s Microsoft OLEDB-DM Grid Services OGSA and data mining Developing Data Mining Application Using Data Mining Standards Application Requirement Specification Design and Deployment Analysis Conclusion Appendix: A1] PMML example A2] XMLA example A3] OLEDB A4] OLEDB-DM example A5] SQL / MM Example [A6] Java Data Mining Model Example... 32

3 1 Introduction Researchers in data mining and knowledge discovery are creating new, more automated methods for discovering knowledge to meet the needs of the 21st century. This need for analysis will keep growing, driven by the business trends of one-to-one marketing, customerrelationship management, enterprise resource planning, risk management, intrusion detection and Web personalization all of which require customer-information analysis and customerpreferences prediction. [GrePia] Deploying a data mining solution requires collecting data to be mined, cleaning and transforming its attributes to provide the inputs for data mining models. Also these models need to be built, used and integrated with different applications. Moreover it is required that currently deployed data management software be able to interact with the data mining models using standards APIs. The scalability aspect calls for collecting data to be mined from distributed and remote locations. Employing common data mining standards greatly simplifies the integration, updating, and maintenance of the applications and systems containing the models. [stdhb] Over the past several years, various data mining standards have matured and today are used by many of the data mining vendors, as well as by others building data mining applications. With the maturity of data mining standards, a variety of standards-based data mining services and platforms can now be much more easily developed and deployed. Related fields such as data grids, web services, and the semantic web have also developed standards based infrastructures and services relevant to KDD. These new standards and standards based services and platforms have the potential for changing the way the data mining is used. [kdd03] The data mining standards are concerned with one or more of the following issues [stdhb]: 1. The overall process by which data mining models are produced, used, and deployed: This includes, for example, a description of the business interpretation of the output of a classification tree. 2. A standard representation for data mining and statistical models: This includes, for example, the parameters defining a classification tree. 3. A standard representation for cleaning, transforming, and aggregating attributes to provide the inputs for data mining models: This includes, for example, the parameters defining how zip codes are mapped to three digit codes prior to their use as a categorical variable in a classification tree. 4. A standard representation for specifying the settings required to build models and to use the outputs of models in other systems: This includes, for example, specifying the name of the training set used to build a classification tree. 5. Interfaces and Application Programming Interfaces (APIs) to other languages and systems: There are standard data mining APIs for Java and SQL. This includes, for

4 example, a description of the API so that a classification tree can be built on data in a SQL database. 6. Standards for viewing, analyzing, and mining remote and distributed data: This includes, for example, standards for the format of the data and metadata so that a classification tree can be built on distributed web-based data. The current established standards address these different aspects or dimensions of data mining application development. They are summarized in Table 1.1. Areas Data Mining Standard Description Process Standards XML Standards Standard APIs Protocol for transport of remote and distributed data. Model Scoring Standard Web Standards Grid Standards Cross Industry Standard Process for Data Mining (CRISP-DM) Predictive Model Markup Language (PMML) Common Warehouse Model for Data Mining (CWM-DM) SQL/MM, Java API (JSR-73), Microsoft OLE-DB Data Space Transport Protocol (DSTP) Predictive scoring and update protocol (PSUP) XML for analysis (XMLA) Semantic Web Data Space Open Grid Service Architecture Captures Data Mining Process: Begins with business problem and ends with the deployment of knowledge gained in the process. Model for representing Data Mining and statistical data. Model for meta data that specifies metadata for building settings, model representations, and results from model operations Models are defined through the Unified Modeling Language. API for Data Mining applications DSTP is used for distribution, enquiry and retrieval of data in a data space. PSUP can be used for both on line real time scoring and updates as well as scoring in an off line batch environment (Scoring is the process of using statistical models to make decisions.) Standard web service interface designed specifically for online analytical processing and data-mining functions (uses Simple Object Access Protocol (SOAP)) Semantic Web provides a framework to represent information in machine processable form and can be used to extract knowledge from Data Mining Systems. Provides an infrastructure for creating a web of data. Is built around standards like XML, DSTP, PSUP. Helps handle large data sets which are present on remote and distributed locations. Developed by Globus, this standard talks about Service based open architecture for distributed virtual organizations. It will provide data mining engine with secure, reliable and scaleable high bandwidth access to the various distributed data sources and formats across various administrative domains. Table 1: Summary of Data Mining Standards

5 Section 2 describes the above standards in details. In section 3 we design and develop a data mining application using the above standards. Section 4 analyzes the standards and their relationship with each other and proposes the areas where standards are needed. 2. Data Mining Standards 2.1 Process Standards CRISP-DM CRISP-DM stands for CRoss Industry Standard Process for Data Mining. It is industry, tool and application neutral standard for defining and validating data mining process. It was conceived in late 1996 by DailerChrysler, SPSS and NCR. The latest version is CRISP-DM 1.0. Motivation: As the market interest in data mining was resulting into its widespread uptake every new adopter of data mining was required to come up with his own approach of incorporating data mining in his current set up. There was also a requirement of demonstrating that data mining was sufficiently mature to be adopted as a key part of any customer s business process. CRISP-DM provided the standard process model for conceiving, developing and deploying a data mining project which is non-propriety and freely distributed. Standard Description: The CRISP-DM organizes the process model into hierarchical process model. At the top level the task is divided into phases. Each phase consists of several second level generic tasks. These tasks are complete (covering the phase and all possible data mining applications) and stable (valid for yet unforeseen developments). These generic tasks are mapped to specialized tasks. Finally these specialized tasks contain several process instances which are record of the actions, decisions and results of an actual data mining engagement process. This is depicted in Figure 1. Mapping of the generic tasks (e.g. task for cleaning data) to specialized task (e.g. cleaning numerical or categorical value) depends on the data mining context. CRISP-DM distinguishes between four different dimensions of data mining contexts. These are: Application domain (areas of the project e.g. Response Modeling) Data mining problem type (e.g. clustering or segmentation problem) Technical aspect (issues like outliers or missing values) Tool and technique (e.g. Clementine or decision trees).

6 The more value for these different context domains are fixed, the more concrete is the data mining context. The mappings can be done for the current single data mining project in hand or for the future. The process reference model consists of phases shown in figure 1 and summarized in table 2. The sequence of the phases is not rigid. Depending on the outcome of each phase, which phase or which particular task of a phase to be performed next is determined. [CRSP] Phases Business understanding Specialized Tasks Data understanding Data Preparation Modelling Each Phase Generic Tasks Generic Tasks M A P P I N G Specialized Tasks Specialized Tasks Process Instances Evaluation Deployment Four Level breakdown of CRISP-DM methodology Figure 1: CRISP-DM process Model Interoperability with other standards: CRISP-DM provides a reference model which is completely neutral to other tools, vendors, applications or existing standards. Phases Description Business understanding Focuses on assessing and understanding the project objectives and requirements from a business perspective, then converting this knowledge into a data mining problem definition and a preliminary plan designed to achieve the objectives. Data - Starts with an initial data collection.

7 understanding - The data collected is then described and explored (e.g. target attribute of a prediction task is identified). - Then the data quality is verified (e.g. noise or missing values). Data preparation Covers all activities to construct the final dataset (data that will be fed into the modeling tool(s)) from the initial raw data. Data preparation tasks are likely to be performed multiple times and not in any prescribed order. The data to be used for analysis is - Selected - Cleaned (their data quality is raised to the level required by the analysis technique) - Constructed (e.g. derived attributes like area = length * breadth are created) - Integrated (information from multiple tables is combined to create new labels) and formatted. Modeling - Specialized Modeling techniques are selected (e.g decision tree with C4.5 algorithm) - Test design is generate to test model s quality and validity. - The modeling tool is run on created data set. - The model is assessed and evaluated. (accuracy tested) Evaluation - The degree to which the model meets the business objectives is assessed. - The model undergoes a review process identifying the objectives missed or accomplished based on this whether the project should be deployed or not is determined. Deployment Depending on the requirements, the deployment phase can be as simple as generating a report or as complex as implementing a repeatable data mining process across the enterprise. A deployment plan is chalked out before actually carrying out the deployment. Table 2: Phases in CRISP-DM Process Reference Model 2.2 XML Standards/ OR Model defining standards<todo> PMML PMML stands for The Predictive Model Markup Language. It is being developed by the Data Mining Group [dmg], a vendor led consortium which currently includes over a dozen vendors including Angoss, IBM, Magnify, MINEit, Microsoft, National Center for Data Mining at the University of Illinois (Chicago), Oracle, NCR, Salford Systems, SPSS, SAS, and Xchange. PMML is used to specify the models. The latest version of PMML Version 2.1 was released in March, There have been 6 releases so far. Motivation: A standard representation for data mining and statistical models was required. Apart from this it was required that it be relatively narrow so that it could serve as common ground for several subsequent standards so that these standards could interoperate. Standard Description:

8 PMML is an XML mark up language which provides a way for applications to define statistical and data mining models and to share models between PMML compliant applications. It allows users to develop models within one vendor's application, and use other vendors' applications to visualize, analyze, evaluate or otherwise use the models. It describes the inputs to data mining models, the transformations used prior to prepare data for data mining, and the parameters which define the models themselves. [PMMSche] [stdhb]. PMML consists of the components summarized in table 3. PMML Component Data Dictionary Mining Schema Transformation Dictionary Model Statistics Model Parameters Mining Functions Description Data dictionary contains data definitions that do not vary with the model. - Defines the attributes input to models - Specifies the type and value range for each attribute. The mining schema contains information that is specific to a certain model and varies with the model. Each model contains one mining schema that lists the fields used in the model. These fields are a subset of the fields in the Data Dictionary. E.g. the Mining Schema specifies the usage type of an attribute, which may be active (an input of the model), predicted (an output of the model), or supplementary (holding descriptive information and ignored by the model). Defines derived fields. Derived fields may be defined by: - Normalization which maps continuous or discrete values to numbers - Discretization which maps continuous values to discrete values - Value mapping, which maps discrete values to discrete values - Aggregation which summarizes or collects groups of values, e.g. by computing averages. The Model Statistics component contains basic univariate statistics about the model, such as the minimum, maximum, mean, standard deviation, median, etc. of numerical attributes. PMML also specifies the actual parameters defining the statistical and data mining models per se. The different models supported in Version 2.1 are: regression models, clusters models, trees, neural networks, Bayesian models, association rules, sequence models. Since different models like neural networks and logistic reasoning can be used for different purposes e.g. some instances implement prediction of numeric values, while others can be used for classification. Therefore, PMML Version 2.1 defines five different mining functions which are association rules, sequences, classifications, regression and clustering. Table 3: PMML Components of Data Mining Model Since PMML is an XML based standard, the specification comes in the form of an XML Document Type Definition (DTD). A PMML document can contain more than one model. If the application system provides a means of selecting models by name and if the PMML

9 consumer specifies a model name, then that model is used; otherwise the first model is used. Please Appendix A1 for an example of PMML. [stdhb] Interoperability with other standards: PMML is complementary to many other data mining standards. Its XML interchange format is supported by several other standards, such as XML for Analysis, JSR 73, and SQL/MM Part 6: Data Mining. PMML provides applications a vendor-independent method of defining models so that proprietary issues and incompatibilities are no longer a barrier to the exchange of models between applications CWM-DM CWM-DM stands for Common Warehouse Model for Data Mining. It was specified by members of the JDM expert group and has many common elements with JDM. It s a new specification for data mining metadata and has recently been defined using the Common Warehouse Metadata (CWM) specification from Object Management Group. Motivation: Different data warehousing solutions including data mining solutions should be provided transparently to applications through a unified metadata management environment. Metadata not only links individual software components provided by one software vendor, but it also has the potential to open a data warehousing platform from one provider to third-party analytic tools and applications. The Common Warehouse Metamodel is a specification that describes metadata interchange among data warehousing, business intelligence, knowledge management and portal technologies. The OMG Meta-Object Facility bridges the gap between dissimilar metamodels by providing a common basis for meta-models. If two different meta-models are both MOF-conformant, then models based on them can reside in the same repository. Standard Description: The CWM-DM consists of the following conceptual areas which are summarized in Table 4. CWM DM also defines tasks that associate the inputs to mining operations, such as build, test, and apply (score). [CurrPaYa] CWM-DM areas Model description Settings Description This consists of: - MiningModel, a representation of the mining model itself - MiningSettings, which drive the construction of the model - ApplicationInputSpecification, which specifies the set of input attributes for the model - MiningModelResult, which represents the result set produced by the testing or application of a generated model. Mining Settings has four subclasses representing settings for

10 - StatisticsSettings - ClusteringSettings - SupervisedMiningSettings - AssociationRulesSettings. The Settings represents the mining settings of the Data Mining algorithms on the function level including specific mining attributes. Attributes The Attributes defines the Data Mining attributes and has MiningAttribute as its basic class. Table 4: CWM-DM conceptual areas Interoperability with other standards: CWM supports interoperability among data warehouse vendors by defining Document Type Definitions (DTDs) that standardize the XML metadata interchanged between data warehouses. The CWM standard generates the DTDs using the following three steps: First, a model using the Unified Modeling Language is created. Second the UML model is used to generate a CWM interchange format called the Meta-Object Facility / XML Metadata Interchange. Third, the MOF/XML is converted automatically to DTDs. 2.3 Web Standards With the expansion of the World Wide Web, it has become one of the largest repositories of data. Hence it is possible that data to be mined is distributed and needs to be accessed via web XMLA Microsoft and Hyperion had introduced XML for Analysis which is a Simple Object Access Protocol (SOAP)-based XML API designed for standardizing data access between a web client application and an analytic data provider, such as an OLAP or data mining application. XMLA APIs supports the exchange of analytical data between clients and servers on any platform and with any language.[xmla] Motivation: Under traditional data access techniques, such as OLE DB and ODBC, a client component that is tightly coupled to the data provider server must be installed on the client machine in order for an application to be able to access data from a data provider. Tightly coupled client components can create dependencies on a specific hardware platform, a specific operating system, a specific interface model, a specific programming language, and a specific match between versions of client and server components. The requirement to install client components and the dependencies associated with tightly coupled architectures are unsuitable for the loosely coupled, stateless, cross-platform, and language independent environment of

11 the Internet. To provide reliable data access to Web applications the Internet, mobile devices, and cross-platform desktops need a standard methodology that does not require component downloads to the client. Extensible Markup Language (XML) is generic and can be universally accessed. XML for Analysis advances the concepts of OLE DB by providing standardized universal data access to any standard data source residing over the Web without the need to deploy a client component that exposes COM interfaces. XML for Analysis is optimized for the Web by minimizing roundtrips to the server and targeting stateless client requests to maximize the scalability and robustness of a data source. [kddxml] Standard Description: XMLA XML based communication API - defines two methods, Discover and Execute, which consume and send XML for stateless data discovery and manipulation.. The two APIs are summarized in table 5. XMLA APIS Discover Description It is used to obtain information (e.g. a list of available data sources) and meta data from Web Services. The data retrieved with the Discover method depends on the values of the parameters passed to it. Syntax: Discover ( [in] RequestType As EnumString, [in] Restrictions As Restrictions [in] Properties As Properties,// [out] Resultset As Rowset) RequestType: Determines the type of information to be returned Restrictions: Enables the user to restrict the data returned in Resultset Properties: Enables the user to control some aspect of the Discover method, such as defining the connection string, specifying the return format of the result set, and specifying the locale in which the data should be formatted. The available properties and their values can be obtained by using the DISCOVER_PROPERTIES request type with the Discover methodresultset. ResultSet: This required parameter contains the result set returned by the provider as a Rowset object. Execute The Execute method is used for sending action requests to the server. This includes requests involving data transfer, such as retrieving or updating data on the server. Syntax: Execute ( [in] Command As Command, [in] Properties As Properties, [out] ResultSet As ResultSet) Command: It consists of a provider-specific statement to be executed. For example, this parameter contains a <Statement> tag that contains an SQL command or query. Properties: Each property allows the user to control some aspect of the Execute method, such as defining the connection string, specifying the return format of the result set, or specifying the locale in which the data should be formatted.

12 ResultSet: This required parameter contains the result set returned by the provider as a Rowset object. The Discover and Execute methods enable users to determine what can be queried on a particular server and, based on this, submit commands to be executed. An Example The client having the URL for a server hosting a Web service sends Discover and Execute calls using the SOAP and HTTP protocols to the server. The server instantiates the XMLA provider, which handles the Discover and Execute calls. The XMLA provider fetches the data, packages it into XML, and then sends the requested data as XML to the client. Table 5: XMLA APIs See Appendix A2 for a detailed example of XMLA. Interoperability with other standards: XMLA specification is built upon the open Internet standards of HTTP, XML, and SOAP, and is not bound to any specific language or technology Semantic Web The World Wide Web Consortium (W3C) standards for the semantic web defines a general structure for knowledge using XML, RDF, and ontologies [W3C SW]. The semantic web approach develops languages for expressing information in machine processable form. The Semantic Web provides a common framework that allows data to be shared and reused across application, enterprise, and community boundaries. It is a collaborative effort led by W3C with participation from a large number of researchers and industrial partners and is based on the Resource Description Framework (RDF), which integrates a variety of applications using XML for syntax and URIs for naming. This infrastructure in principle can be used to store the knowledge extracted from data using data mining systems, although at present, one could argue that this is more of a goal than an achievement. As an example of the type of knowledge that can be stored in the semantic web, RDF can be used to code assertions such as "credit transactions with a dollar amount of $1 at merchants with a MCC code of 542 have a 30% likelihood of being fraudulent." [stdhb] Data Space Data Space is an infrastructure for creating a web of data or data webs. The general operations in the web involve browsing remote pages or documents where as the main purpose of having a data space is to explore and mine remote columns of distributed data. Data webs are similar to semantic webs except that they house data instead of documents. Motivation:

13 The web today contains a large amount of data. Although the amount of scientific, health care and business data is exploding, we do not have the technology today to casually explore remote data nor to mine distributed data.[stdhb]. The size of individual data sets has also increased. There are a certain issues involved in the process of analyzing such a data. The multimedia documents on the web cannot be directly used for the process of mining and analyzing. Another issue is that the current web structure does not optimally support handling of large data sets and is best suited only for browsing hypertext documents.[rdsw] Hence there is a need to have a standard support to this data. The concept of a data space helps explore, analyze and mine such data. Standard Description: The DataSpace project is supported by the National Science Foundation and has Robert Grossman as its director. DataSpace is built around standards developed by the Data Mining Group and W3C. The concept of a Data Space is based upon XML and web services which are W3C maintained standards. Data Space defines a protocol DSTP (DataSpace Transfer Protocol) for distribution, enquiry and retrieval of data in a DataSpace. It also works with the real time scoring standard PSUP( Predictive Scoring and Update Protocol).[Dsw] The DataSpace consists of the following components: Data Web DSTP DSTP PSUP for realtime Open Source Server Open Source Client Data Mining Engine Access Remote Data View and Mine Data PMML Figure 2: DataSpace Architecture DSTP is the protocol for the distribution, enquiry and retrieval of data in a DataSpace. The data could be stored in files, databases or distributed databases. It has a corresponding XML file, which contains Universal Correlation Key tags (UCK) that act as identification keys.

14 The UCK is similar to a primary key in a database. A join can be performed by merging data from different servers on the basis of UCKs.[DSTP] The Predictive Scoring and Update Protocol is a protocol for event driven, real time scoring. Real time applications are becoming increasing important in business, e-business, and health care. PSUP provides the ability to use PMML applications in real time and near real time applications. For the purpose of data mining a DSTP client is used to access the remote data. The data is retrieved from the required sites and DataSpace is designed to interoperate with proprietary and open source data mining tools. In particular the open source statistical package R has been integrated into Version 1.1 of DataSpace and is currently being integrated into Version 2.0. DataSpace also works with predictive models in PMML, the XML markup language for statistical and data mining models. Standard DSTP PSUP Description Provides direct support for attributes, keys and meta data. Also supports: Attribute Selection Range Queries Sampling Other functions for accessing and analyzing remote data Is a protocol is a protocol for event driven, real time scoring. PSUP provides the ability to use PMML in real time applications. Table 6: Summary of Data Space Standards 2.4 Application Programming Interfaces (APIs) Earlier, application developers wrote their own data mining algorithms for applications, or used sophisticated end-user GUIs. The GUI package for data mining included complete range of methods for data transformation, model building, testing and scoring. But it remained challenging to integrate data mining and the application code due to lack of proper APIs to do the task. APIs were vendor specific and hence proprietary. Thus the product developed would become dependent and hence risky to market. To switch to a different vendor s solution the entire code had to be re-written which made the process costly. In short it was realized that data-mining solutions must co-exist. Hence the need arose to have a common standard for the APIs. The ability to leverage data mining functionality via a standard API greatly reduces risk and potential cost. With a standard API customers can use multiple products for solving business problems by applying the most appropriate algorithm implementation without investing resources to learn each vendor's proprietary API. Moreover, a standard API makes data mining more accessible to developers while making developer skills more transferable. Vendors can now differentiate themselves on price, performance, accuracy, and features. [JDM]

15 2.4.1 SQL/ MM DM SQL/MM is an ISO/IEC international standardization project. The SQL/MM suite of standards includes parts used to manage full-text data, spatial data, and still images. The part 6 of the standard addresses data mining. Motivation: Database systems should be able to integrate data mining applications in a standard way so as to enable the end-user to perform data mining with ease. Data Mining has become a part of modern data management and could be said to be a sophisticated tool to extract information or to aggregate the original data. SQL is a language widely used by database users today and provides basic operations of aggregate, etc. Thus Data Mining could be said to be a natural extension to the primitive functionalities provided by SQL. Hence it becomes obvious to standardize data mining through SQL. Standard Description: The SQL/MM Part 6:Data mining standard provides an API for data mining applications to access data from SQL-MM compliant relational databases. It defines structured user defined types including associated methods to support data mining. It attempts to provide a standardized interface to data mining algorithms that can be layered atop of any objectrelational database system and even deployed as a middleware when required. [Sqlm] The table below provides a brief description of the standard: [Sqlm][Cti] Description Data Mining Techniques 4 Different data mining techniques supported by this: Row Model Allows to search for patterns and relationships between different parts of your data Clustering Model Regression Model Classification Model Helps grouping of Clusters Helps predict the ranking of new data base upon the analysis of existing data Helps predicting the grouping or class of the new data Data Mining Stages 3 distinct stages through which data can be mined Train Choose technique most appropriate Set parameters to orient the model Train by applying reasonably sized data Test Apply For classification and regression test with known data and compare the model s predictions Apply the model to the business data

16 Supporting Data Types DM_*Model, Defines the model that you want to use when mining your data DM_*Settings Stores various parameters of the data mining model, e.g. - Depth of a decision tree - Maximum number of clusters DM_*Result Created by running data mining model against real data DM_*TestResult Holds the results of testing during the training phase of the data mining models DM_*Task Stores the metadata that describe the process and control of the testing and of the actual runnings. where * could be Clas - Classification Model Rule Rule Model Clustering Clustering Model Regression Regression Model Table 7: Summary of SQL/MM DM Standard Java API s Java Specification Request -73 (JSR-73) also known as Java Data Mining (JDM), defines a pure Java API to support data mining operations. The JDM development team was led by Oracle and included other members like Hyperion, IBM, Sun Microsystems, and others. Motivation: Java has become a language that is widely used by application developers. The Java 2 Platform, Enterprise Edition (J2EE) provides a standard development and deployment environment for enterprise applications. It reduces the cost and complexity of developing multi-tier enterprise services by defining a standard, platform-independent architecture for building enterprise components. JSR-73 provides a standard way to create, store, access and maintain data and metadata supporting data mining models, data scoring and data mining results serving J2EE compliant application servers. It provides a single standard API or data mining system that will be understood by a wide variety of client applications and components running on the J2EE platform. This specification does not preclude, however, the use of JDM services outside of the J2EE environment.

17 Standard Description: Defining compliance for vendor specification asks for addressing several issues. In JDM, data mining includes the functional areas of classification, regression, attribute importance, clustering and association. These are supported by Supervised and unsupervised algorithms as decision trees, neural networks, Naïve Bayes, Support Vector Machines, K-means on structured data. A particular implementation of this specification may not necessarily support all interfaces and services provided by JVM. JDM is based on a generalized, object-oriented, data mining conceptual model leveraging emerging data mining standards such the Object Management Group s Common Warehouse Metadata (CWM), ISO s SQL/MM for Data Mining, and the Data Mining Group s Predictive Model Markup Language (PMML), as appropriate implementation details of JDM are delegated to each vendor. A vendor may decide to implement JDM as a native API of its data mining product. Others may opt to develop a driver/adapter that mediates between a core JDM layer and multiple vendor products. The JDM specification does not prescribe a particular implementation strategy, nor does it prescribe performance or accuracy of a given capability or algorithm. To ensure J2EE compatibility and eliminate duplication of effort, JDM leverages existing specifications. In particular, JDM leverages the Java Connection Architecture to provide communication and resource management between applications and the services that implement the JDM API. JDM also reflects aspects the Java Metadata Interface. [JDM] Architectural Components Data Mining Functions Data Mining Tasks JDM has 3 logical components: Application Programming Interface: Is the end-user visible component of a JDM implementation that allows access to the services provided by the data mining engine. An application developer would require the knowledge of only this library Data Mining Engine: Provides the infrastructure that offers a set of data mining services to the API clients Metadata repository: Serves to persistent data mining objects. The repository can be based on the CWM framework. JDM specifies the following data mining functions: Classification: Classification analyzes the input or the build data and predicts to which class a given case belongs. Regression: Regression involves predicting a continuous, numerical valued target attribute given a set of predictors. Attribute Importance: Determines which attributes are most important for building a model. Helps users to reduce the model build time, scoring time, etc. Similar to feature selection. Clustering: Clustering Analysis finds out clusters embedded in the data, where a cluster is a collection of data objects similar to one another. Association: Has been used in market basket analysis and analysis of customer behavior for the discovery of relationships or correlations among a set of items. Data Mining revolves around a few common data mining tasks: Building a Model: Users define input tasks specifying the parameters model name, mining data and mining settings. JDM enables users to build models in the functional areas classification, regression, attribute importance, clustering and association.

18 Testing a Model: Gives an estimate of the accuracy a model has in predicting the target. Follows model building to compute the accuracy of a model s predictions when the model is applied to a previously unseen data set. Input consists of model and data for testing the model. Test results could be confusion matrix, error estimates, etc. Lift is a measure of effectiveness of a predictive model. A user may specify to compute lift. Applying a Model: Model is finally applied to a case. Produces one or more predictions or assignments. JDM enables Object Import and Export: Could be useful in Interchange with other DMEs Persistent storage outside the DME Object inspection or manipulation To enable import and export of system metadata JDM specifies 2 standards for defining metadata in XML PMML for mining models CWM Computing statistics on data: Provides to compute various statistics on a given physical data set. Verifying task correctness Extension Packages Conformance Statement javax.datamining javax.datamining.settings javax.datamining.models javax.datamining.transformations javax.datamining.results JDM API standard is flexible and allows vendors to implement only specific functions that they want their product to support. Packages divided into 2 categories - Required: Vendors must provide an implementation for this. - Optional: A vendor may or may not implement these. Table 8: Summary of Java Data Model Standards Microsoft OLEDB-DM In July 2001 Microsoft released specification document [3] for first real industrial standard for data mining called OLE DB for Data Mining. This API is supported by Microsoft and in part of release of Microsoft SQL Server 2000 (Analysis Server component). See Appendix A3 for an overview of OLEDB. Motivation: An industry standard was required for data mining so that different data mining algorithms from various data mining ISVs can be easily plugged into user applications. OLEDB-DM addressed the problem of deploying models (once the model is generated, how to store, maintain, and refresh it as data in the warehouse is updated, how to

19 programmatically use the model to do predictions on other data sets, and how to browse models over the life cycle of an enterprise) Another motivation to introduce OLE DB DM was to enable enterprise application developers to participate in building data mining solutions. For this it was required that the infrastructure for supporting data mining solution is aligned with traditional database development environment and with APIs for database access. Standard Description: OLE DB for DM is an OLE DB extension that supports data mining operations over OLE DB data providers. It has a concept of Data mining providers: Software packages that provide data mining algorithms. Data mining consumers: Those applications that use data mining features. OLE DB for DM specifies the API between data mining consumers and data mining providers. It introduces two new concepts of cases and models in the current semantics of OLEDB. CaseSets: Input data is in the form of a set of cases (caseset). A case captures the traditional view of an observation by machine learning algorithms as consisting of all information known about a basic entity being analyzed for mining as opposed to the normalized tables stored in databases. It makes use of the concept of nested tables for this. Data mining model (DMM): It is treated as if it were a special type of table: A caseset is associated with a DMM and additional meta-information while creating (defining) a DMM. When data (in the form of cases) is inserted into the data mining model, a mining algorithm processes it and the resulting abstraction (or DMM) is saved instead of the data itself. Once a DMM is populated, it can be used for prediction, or its content can be browsed for reporting. The key operations to support on a data mining model are shown in Table 9. This model has an advantage of having a low cost of deployment. See Appendix A3 for an example. Operations on DMM Define Populate Description Identifying the set of attributes of data - to be predicted - to be used for prediction and the algorithm used to build the mining model Populating a mining model from training data using the algorithm specified in its definition above CREATE statement Syntax Repeatedly via the INSERT INTO statement (used to add rows in a SQL table), and emptied (reset) via the DELETE statement. Predict Browse Predicting attributes for new data using a mining model that has been populated Browsing a mining model for reporting and visualization applications Table 9: DMM Operations Prediction on a dataset made by making a PREDICTION JOIN between the mining model and the data set. Using SELECT statement

20 Interoperability with other standards: OLE DB for DM is independent of any particular provider or software and is meant to establish a uniform API. It is not specialized to any specific mining model but is structured to cater to all well-known mining models. [MSOLE] 2.5 Grid Services Grids are collections of computers or computer networks, connected in a way that allows for sharing of processing power and storage as well as applications and data. Grid technologies and infrastructures are hence defined as supporting the sharing and coordinated use of diverse resources in dynamic, distributed virtual organizations.[grid] OGSA and data mining The Open Grid Services Architecture (OGSA) represents an evolution towards a Grid architecture based on Web services concepts and technologies. It consists of a well-defined set of basic interfaces which used to communicate extensibility, vendor neutrality, and commitment to a community standardization process. It uses the Web Services Description Language (WSDL) to achieve self-describing, discoverable services and interoperable protocols, with extensions to support multiple coordinated interfaces and change management. Motivation: In a distributed environment, it is important to employ mechanisms that help in communicating interoperably. A service oriented view partitions this interoperability problem into two sub problems: Definition of service interfaces and the identification of the protocol(s) that can be used to invoke a particular interface Agreement on a standard set of such protocols A service-oriented view allows local/remote transparency, adaptation to local OS services, and uniform service semantics. A service-oriented view also simplifies encapsulation behind a common interface of diverse implementations that allows for consistent resource access across multiple heterogeneous platforms with local or remote location transparency, and enables mapping of multiple logical resource instances onto the same physical resource and management of resources. Thus service definition is decoupled from service invocation. OGSA describes and defines a service oriented architecture composed of a set of interfaces and their corresponding behaviors to facilitate distributed resource sharing and accessing in heterogeneous dynamic environments. Data is inherently distributed and hence the data mining task needs to be performed keeping this distributed environment in mind. Also it is required to provide data mining as a service. Grid technology provides secure, reliable and scaleable high bandwidth access to distributed data sources across various administrative domains which can be exploited. Standard Description:

21 Service Requester Bind Find Transport Medium Service Provider Service Directory Publish Figure 3: Service oriented architecture Figure 3 shows the individual components of the service-oriented architecture (SOA). The service directory is the location where all information about all available grid services is maintained. A service provider that wants to offer services publishes its services by putting appropriate entries into the service directory. A service requestor uses the service directory to find an appropriate service that matches its requirements. An example of data mining scenario using this architecture is as follows. When a service requestor locates a suitable data mining service, it binds to the service provider, using binding information maintained in the service directory. The binding information contains the specification of the protocol that the service requestor must use as well as the structure of the request messages and the resulting responses. The communication between the various agents occurs via an appropriate transport mechanism. Grid offers basic services that include resource allocation and process management, unicast and multicast communication services, security services, status monitoring, remote data access etc. Apart from this there is Data Grid that provides Grid FTP (a secure, robust and efficient data transfer protocol) and Metadata information management system. Hence, the grid-provided functions do not have to be re-implemented for each new mining system e.g. single sign-on security, ability to execute jobs at multiple remote sites, ability to securely move data between sites, broker to determine best place to execute mining job, job manager to control mining jobs etc. Therefore, mining system developers can focus on the mining applications and not the issues associated with distributed processing. However, the standards for these are yet to be developed. Interoperability with other standards:

22 The standard for Grid Services is yet to emerge. 3. Developing Data Mining Application Using Data Mining Standards In this section we describe a data mining application. We then describe its architecture using data mining standards. However we see that not all the architecture constructs can be standardized as no standards are available for them. We point this out in more detail in Section 4 below. 3.1 Application Requirement Specification A multinational food chain has its outlets in several countries i.e. India, USA and China. The outlets in each of these want information regarding: Combinations of food items that constitute their happy meal. Most preferred food items they need to target for their advertisements in the respective country. Preferred seasonal food items. Information about the food item, their prices and their popularity and coming up with patterns that reveal the relationship between the pricing and the popularity. The above information must be obtained from these transactions solely as the food chain company does not want to indulge in any surveys. All the customer transactions of each outlet are recorded. The transactions contain along with customer id, the food items, their prices and the time at which the order was placed. However each outlet could store transactions in different databases like Oracle, Sybase for the same. As we see this is a typical data mining application. In the next section we describe the run time architecture of the data mining system. We also see how application of standards make the components of this architecture independent of each other as well as of the underlying platform or technology. 3.2 Design and Deployment Architecture Overview: In the architecture shown in Figure 4, the outlets (data sources) are spread in multiple locations (Location A, B, C) henceforth referred to as remote data sources. The data before being mined has to be aggregated in a single location. For this we use a client server architecture. Each of the remote data sources have a data server which might connect to the respective database using any standard drivers. A client is deployed in the location where data to be mined is collected. This client contacts these servers for browsing or retrieving data. As mentioned in the figure we need a standard for data transport over the web so that this entire client server architecture can be independently developed and deployed.

23 Location A Location B Location C Data Data Data Server Server Server 1) Standard for data transport over the web Location where data mining task is being Client Data Warehouse 5) Standard for data cleaning, transformation Driver 2) Data Connectivity 4) Standard API 6) Standard for representing decision Output Data Mining Engine Application Mining Model Data Mining Engine Mining Model 3) Standard Model Representation Data Mining Engine Mining Model 4) Standard API Figure 4 Architecture of Data Mining Application The client stores the data in a data warehouse so that data mining operations can be performed on it. But before the data is to be mined it needs to be cleaned and transformed. Some standards should be present for this purpose. The DataMining Engine accesses the data in the warehouse with the help of standard data connectivity mechanisms. It produces a mining model such a decision tree, etc. This model is then used to discover patterns in the data. It is required that the model produced be represented in a standard format so as to allow inter-operability across vendors as well as different data mining engine. Hence a standard is required for the same.

24 The data mining engine is accessible to the end-user via an application programming interface. The application requiring data mining contains the calls of the API. This set of APIs should be standardized so as to allow the application to switch to a different vendors solution without being concerned about changing his entire code. Also, once the data mining task is performed the output produced needs to be incorporated into the existing business model. Hence the decisions or suggestions recommended by the data mining model needs to be stored. For this a standardized decision model is required that incorporates this decision model with the current business model. Standards employed in the architecture: For the data transport over the web the standard DSTP [Section 2.3.3] is employed. The mining model produced by the data-mining engine is PMML [Section 2.2.1] compliant so as to enable inter-operability. If not PMML then any model that confirms to meta model specifications of CWM-DM [Section 2.2.2] must be used. However the most widely used model currently is PMML. The data-mining engine connects to the data warehouse using any of the JDBC or ODBC drivers. Here we are using JDBC driver for it. The application uses the data mining services with the help of the standard API JSR-73. [Section 2.4.2]. The entire system should be developed using the Process Standard CRISP-DM [Section 2.1.1]. If we want this data mining application to be deployed as a web service then we can use a provider server at this end that supports XMLA s Execute and Discover APIs [Section 2.3.1]. Thus any third party can fire queries without having any software installed at its end. Standards not yet defined: As we see there are no current standards that can be used for data transformation. Also there is no standard decision model that could incorporate the output of a data-mining task into our engine. We discuss this further in section 4. Scoring should also be integrated with the mining applications via published standard API's and run-time-library scoring engines. Automation of the scoring process will reduce processing time, allow for the most up-to-date data to be used, and reduce error. 4. Analysis Earlier data mining comprised of algorithms working on flat files with no standards. Industry interest led to development of standards that enabled representation of these algorithms in a model and separation of online development of these models with their deployment. These