Description of Knowledge Discovery Tools in KDTML

Description of Knowledge Discovery Tools in KDTML Domenico Potena, Claudia Diamantini Dipartimento di Ingegneria Informatica, Gestionale e dell Automazione, Università Politecnica delle Marche via Brecce Bianche, 60131 Ancona, Italy {potena,diamanti}@diiga.univpm.it Abstract Knowledge Discovery in Databases (KDD) is a highly complex process where a lot of data manipulation tools with different characteristics have to be used together to reach the goal of previously unknown, potentially useful information extraction. The design of a KDD process implies the search for suitable tools, the understanding of their scope and proper use, their composition and so on. All these activities can be supported by structured knowledge about the tools. This paper is devoted to presenting KDTML, a Knowledge Discovery Tool Markup Language for the annotation of tool characteristics, like the tool location and execution environment, I/O interface and functionalities. As an example of use of KTDML we discuss the implementation of a wrapping service, which allows to automatically transform a KDD tool written in any imperative language into a web service. I. INTRODUCTION The Knowledge Discovery in Databases (KDD) field is influenced by the pervasive spreading of distributed computing environments, and by the achievements of theoretical and technical results in this area. As a matter of fact, the existence of different distributed data and computational resources rises the opportunity and the demand for solutions to effectively exploit such resources. In particular, the recent shift towards network organizations, the great variability of tools which can be exploited in a KDD process, and the dynamism of the Data Mining field, where new algorithms and techniques are developed continuously, suggest to view such tools as services to be found and used on the net, in a sort of open-market environment [1], where the user can look for implementations, suggestions, tools evaluation, examples of use etc., and where distributed organizations can share their tools, data and results, thus minimizing re-implementation and enhancing reusability. On the other hand, integration and interoperability issues have to be faced in order to exploit heterogeneous tools developed by different authors, while support has to be given to users for the dynamical discovery of algorithms and data over the Internet [2], [3], [4]. Following the mainstream of the semantic web, the KDD field is thus observing the evolution of XML-based description languages, the definition of domain ontologies [5], [6, chap. 23] and standards development [7, chap. 19]. While proposals of languages for the description of KDD models and data exist [8], [9], at the best of our knowledge no attention to the description of KDD tools is given. In the present paper we try to fill this gap, by introducing KDTML, a Knowledge Discovery Tool Markup Language. This language describes the tool functionalities, where it is localized, its execution environment, and the I/O interface. By taking advantage of this structured knowledge, users can more easily discover new tools, understand their scope, design KDD process by tool composition and so on. In particular, in the paper we show the exploitation of the KDTML description for the automatic wrapping of legacy KDD code in a web service environment. This work is part of a more general project for the development of an extensible service oriented Knowledge Discovery support system hereafter called Knowledge Discovery in Databases Virtual Mart (KDDVM) [3]. The rest of the paper is organized as follows: section II introduces the KDTML structure, while section III gives some detail of the automatic wrapping service developed in the KDDVM project. Some concluding remarks are given in section IV. II. THE KNOWLEDGE DISCOVERY TOOL MARKUP LANGUAGE A process of Knowledge Discovery in Databases (KDD) is a highly interactive and iterative process of data manipulation aimed to the extraction of previously unknown, potentially useful information [10]. Iterativity and interactivity are inherent features of any discovery process, due to its intrinsic complexity, goal-driven and domain dependent nature. The complexity of the design of a KDD process is principally due both to the huge amount of tools the user can choose and to the expertise needed facing various KDD tasks. Assuming a repository of KDD tools is available to the user, he/she has to pass through different activities in order to manage his/her KDD processes: he/she has to browse the tool repository and to obtain information about the tools; to easily introduce in the repository new algorithms or releases; to choose the more suitable tools on the basis of a number of characteristics: tools performances (complexity, scalability, accuracy), the kind of data tools can be used for (textual/symbolic data, numeric, structured data, sequences,...), the kind of goal tools are written for (data cleaning, data transformation, data mining, visualization,

Fig. 2. <name>bvq train</name> <language> <name>c</name> <exe> <path>.\ bvq.exe</path> </exe> <os>linux</os> <compiler> <name>gcc</name> <option>-g -lm -mieee</option> </compiler> </language> First section of the KDTML document related to the BVQ. Fig. 1. DTD schema of the KDTML. tion. Name is a simple data describing the name of the KDD tool. The language element contains both information on the development and test environment. Then the language element is described by the operative systems (os), the language in which the tool is written, the compilers and the options used to compile the software code. Finally, the name and path of the software source or executable (exe) code is given. In the example, the BVQ software is written in C language and it runs under a Linux environment....) the kind of data mining task (classification, rule induction,...); to prepare data conforming to the tool input format; to manage the execution of the tool, in particular to properly set tool parameters for the data and problem at hand; to design the KDD process by tool composition. All the information needed to accomplish these tasks are codified in the Knowledge Discovery Tool Markup Language (KDTML). KDTML tool descriptions are XML documents where XML tags describe the characteristics of tools and their interfaces. The general structure of a KDTML document is given by the DTD schema shown in Figure 1. The language is divided into four main sections: The initial KDTML fragment contains information in order to locate and to execute the tool, the second section describes the tool I/O interface, the third part lists the KDD software modules that can be linked up with the described tool, and the final section is a categorization of the KDD tool with respect to a KDD taxonomy. As an illustrative example, throughout the paper we will use the KDTML description of the tool BVQ train, which is an implementation of the Bayes Vector Quantizer (BVQ) algorithm [11], a Data Mining algorithm used in classification tasks. A. Location and execution The first part of the KDTML document (see Fig. 2) describes a KDD tool by the tags name, language and descrip- B. Input/Output Interface The second part of the document allows us to describe the I/O interface of a KDD tool. The input to a certain KDD tool can be either a call parameter or interactively given during the tool execution. Then, an input data is described by a variable number of parameters, that is strings with optional descriptions, and simple or structured data. Since the input data can be passed directly by file, the input element can be characterized also by a variable number of files, each described by its path name (in path). <input> <ob> </ob> <type>int</type> <name>neurons num</name> <bounds> <min>1</min> </bounds> number of training neurons </input> Fig. 3. Second section of the KDTML document related to the BVQ. A simple data. 2

The input data can be equipped with an optional description, inserted in the tag description, that gives users more information about the meaning of the input. For example, the tag description can be used to explain how to set the value of the related input varying the dataset to be analyzed. A simple data has a name, a type and a set of optional features as the bounds (min and max), the description and the default value. In the Fig. 3, it is illustrated the simple datum neurons num, that is an integer representing the number of code vectors of the vector quantizer. A structured data is characterized by the name, the description, and recursively by a set of data. To represent and manage any input sequences, in the simple data definition as well as in the structured one, we introduce the arguments ob, op and hidden of the input data. They describe respectively an obligatory input, an optional input and an input that the software needs, but that is not given explicitly on the STDIN. The typical case is that of a file containing the training dataset. Even if the user typically supplies only the name of this file as the input to a KDD tool, the tool reads the content of the file according to a fixed predefined structure, that has to be known by the user to format the file correctly. This information is provided by the txt file format element, where the structure of the record is represented using a string, according to the C language I/O format. The fragment of BVQ KDTML document shown in Fig. 4 describes the input file data in, that contains the training dataset to be analyzed. Note the use of the same name data in to describe two different input data: a simple, string-type, data that represent the name and path of the input file, and a structured data of hidden type that describes its format. In this way, these two input data are recognized as related to the same input (file) object. In the example, the format of the training file data in is a list of values separated by a tabulator. The first n values are the features values, while the last one represents the class. A newline separates an instance from another one. Differing from input data, output data can only be data or files (in that case, the tool returns where the files are located), respectively represented by data and out path tags. The output data description can be also given. The implementation chosen for the example returns to the user the name and the path of the file containing the induced model (see Fig. 5). C. Linkable modules In order to achieve a KDD goal, the exploitation of a single tool is often not sufficient, rather a KDD process is typically composed of many different tools, such that the output of a tool represents the input to another one. In order to support the user in such a tool composition activity, a KDTML document reports optional information about KDD tools that can be linked up with the described one: the linkable modules. The KDTML tags <in module> and <out module> indicate tools that can be executed before and after the given tool respectively. In the example (see Fig. 6), the KDTML of the BVQ algorithm explains that it can be initialized by the SOM algorithm. On the other hand, the output of the tool <input> <ob> </ob> <type>string</type> <name>data in</name> name and path of the dataset input file <structured> <hidden> </hidden> <name>data in</name> <txt file format>(%f\ t)*%i\ n <type>float</type> <name>feature</name> value of the feature <type>int</type> <name>class</name> value of the class </txt file format> </structured> </input> Fig. 4. Second section of the KDTML document related to the BVQ. An input dataset. can be redirected either to an algorithm which extracts the Voronoi diagram or to the BVQFM, that is a feature extraction algorithm based on the BVQ internal structure. <in module> and <out module> tags assume values in a predefined but extensible vocabulary of algorithm names, which are categorized according to a KDD taxonomy. Such a taxonomy is discussed in the next subsection. D. Tool characterization by KDD taxonomy In the final part of a KDTML document, the functionalities of a KDD tool are represented according to both a common vocabulary and a predefined taxonomy. Such a structure is an extension of the Data Mining taxonomy DAMON [5], which covers the other KDD tasks, like data cleaning, data transformation, data selection, visualization and so on. A taxonomical structure is a natural way to characterize the KDD tools domain in terms of the implemented task. The goal of each KDD task can be achieved by various and different methods, e.g. for the classification task an user 3

<output> <out path> <path>.\ BVQ model.dat</path> </out path> name and path of the file containing the inducted BVQ model </output> Fig. 5. Second section of the KDTML document related to the BVQ. An output file. <link> <in module>som</in module> <out module>bvqfm</out module> <out module>voronoi</out module> </link> Fig. 6. Third section of the KDTML document related to the BVQ. Linkable modules. can use a Decision Tree, a Neural Network, a Fuzzy Set Approach as well as a Genetic Algorithm. For each method are available a lot of algorithms, e.g. C4.5, ID3 and CART are Decision Tree algorithms. Finally any KDD tool is a specific implementation of a given algorithm. Figure 7 shows the KDTML fragment related to the taxonomical characterization of the BVQ train software. Such a tool implements the BVQ algorithm, that discharges the classification task by a particular vector quantizer method. <taxonomy> <task> Classification </task> <method> Vector Quantizer </method> <algorithm> Bayes Vector Quantizer </algorithm> </taxonomy> Fig. 7. Fourth section of the KDTML document related to the BVQ. The KDD taxonomy. III. AN APPLICATION: THE AUTOMATIC WRAPPING SERVICE As an application of KDTML let us consider the problem of developing a wrapper to transform a generic KDD tool into a web service. Wrapping legacy code is useful to integrate and make interoperable heterogeneous environments. In particular the application we discuss in the following is part of a project for the development of an extensible service oriented KDD support system [3]. To wrap a tool means to build a specific software encapsulating the tool, that can then be viewed and exploited as service. In order to achieve such a goal, an user needs technical knowledge about the Web Service architecture and the languages involved. However, often a KDD user is not a programming expert. Furthermore, it is unpractical to develop the wrappers manually, because it is a time consuming and error-prone task. The KDTML markup language finds an useful application in the design of the Automatic Wrapping Service (AWS), which is devoted to the automatic creation of specific wrappers encapsulating legacy KDD tools. To generate the specific wrappers, the AWS starts from the KDTML information given by the providers i.e. the software developers or anyone who wants to put at disposal a software. The AWS takes and translates the KDTML document into a WSDL descriptor, put at disposal of users for future requests. To support the user in the compilation of the KDTML document, we implemented another service, named XML-Info Generator Service (XIGS). XIGS is designed to put a set of simple questions to the user to compile the KDTML document, providing an user-friendly interface and avoiding editing errors. Using the output of the XIGS, the AWS generates the specific wrapper and the description of the service in WSDL. To fully translate the KDTML software description, the AWS uses an extended WSDL, which includes data mining specific information about the tool descriptions (e.g. name, method and technique implemented, version, authors) and its I/O (e.g. name, default value, range of validity). Fig. 8. Interaction between a user and a wrapped KDD tool. We developed a version of the AWS, which is able to wrap KDD tools written in the most common imperative languages and which do not make use of graphical interfaces. This service is able to create wrappers to manage any input structures and sequences, containing also optional sets of input data. Figure 8 shows the structure of the generated wrapper and the interaction with the user. The user, through any client interface, sends a SOAP request to the KDD service. The request is captured and decoded by the wrapper. The core component of the wrapper interacts with the tool, activating it and waiting for the results. The output is then encoded and sent by a SOAP response message to the user. The complete WSDL description of the BVQ train tool obtained by the AWS is shown in Figure 9. 4

tools, the understanding of their scope, the design of KDD process by tool composition and the management of integration and interoperability issues. This work is part of the Knowledge Discovery in Databases Virtual Mart (KDDVM) project, a more general project for the development of an open and extensible environment where users can look for implementations, suggestions, evaluations, examples of use of tools implemented as services. In this framework, the KDTML finds useful applications in the design of specific services, e.g. broker service or services for the management of both the composition and the activation of KDD processes. In particular, in this work we described the Automatic Wrapping Service, which allows us to automatically transform a KDD tool, written in any imperative language, into a web service. Fig. 9. The WSDL of the BVQ train service. The AWS, the wrapped BVQ tool as well as other KDD services, are available at the KDDVM project site http://babbage.diiga.univpm.it:8080/axis/. IV. CONCLUSIONS In the paper we presented KDTML, a markup language for the description of KDD tools. The use of a markup language to describe KDD software can leverage the discovery of new REFERENCES [1] Krishnaswamy, S., Zaslasvky, A., and Loke, S, W., Internet Delivery of Distributed Data Mining Services: Architectures, Issues and Prospects, in Architectural Issues of Web-enabled Electronic Business, V. Murthy and N. Shi, Eds. Idea Group Publishing, 2003, ch. 7, pp. 113 127. [2] Kumar, A. Kantardzic, M., Ramaswamy, P. and Sadeghian, P., An Extensible Service Oriented Distributed Data Mining Framework, in Proc. IEEE/ACM Intl. Conf. on Machine Learning and Applications, Louisville, KY, USA, 16-18 Dec. 2004. [3] Diamantini, C., Potena, D. and Panti, M., Developing an Open Knowledge Discovery Support System for a Network Environment, in Proc. of the IEEE International Symposium on Collaborative Technologies and Systems, Saint Louis, Missouri, USA, May 15-19 2005, pp. 274 281. [4] Grossman, R.; Mazzucco, M., DataSpace: a data Web for the exploratory analysis and mining of data, IEEE Computing in Science and Engineering, vol. 4, no. 4, pp. 44 51, July-Aug. 2002. [5] Cannataro, M., Knowledge Discovery and Ontology-based services on the Grid, in The First GGF Semantic Grid Workshop, Chicago IL, USA, Oct. 2003. [6] Kargupta, H., Joshi, A., Sivakumar, K., Yesha, Y., Ed., Data Mining: Next Generation Challenges and Future Directions. AAAI/MIT Press, 2004. [7] N. Ye, Ed., Handbook of Data Mining. Kluwer Ac. Pub., 2003. [8] Grossman, R., Gu, Y., Hanley, D., Hong, X., Levera, J., Mazzucco, M., Lillethun, D., Mambretti, J. and Weinberger, J., Photonic Data Services: Integrating Data, Network and Path Services to Support Next Generation Data Mining Applications, in Data Mining: Next Generation Challenges and Future Directions, Kargupta, H., Joshi, A., Sivakumar, K., Yesha, Y., Ed. AAAI/MIT Press, 2004, ch. 5, pp. 89 103. [9] Grossman, R., Biley, S., Ramu, A., Malhi, B., Hallstrom, P., Pulleyn, I. and Qin, X., The management and mining of multiple predictive models using the predictive modeling markup language. Information and System Technology, pp. 589 595, 1999. [10] Fayyad, U. M., Piatetsky-Shapiro, G., Smyth, P. and Uthurusamy, R., Advances in Knowledge Discovery and Data Mining. AAAI/MIT Press, 1996. [11] C. Diamantini and A. Spalvieri, Quantizing for Minimum Average Misclassification Risk, IEEE Trans. on Neural Networks, vol. 9, no. 1, pp. 174 182, Jan. 1998. 5