Advances in Intelligent and Soft Computing 85. Editor-in-Chief: J. Kacprzyk

Transcription

1 Advances in Intelligent and Soft Computing 85 Editor-in-Chief: J. Kacprzyk

2 Advances in Intelligent and Soft Computing Editor-in-Chief Prof. Janusz Kacprzyk Systems Research Institute Polish Academy of Sciences ul. Newelska Warsaw Poland kacprzyk@ibspan.waw.pl Further volumes of this series can be found on our homepage: springer.com Vol. 71. Y. Demazeau, F. Dignum, J.M. Corchado, J. Bajo, R. Corchuelo, E. Corchado, F. Fernández-Riverola, V.J. Julián, P. Pawlewski, A. Campbell (Eds.) Trends in Practical Applications of Agents and Multiagent Systems, 2010 ISBN Vol. 72. J.C. Augusto, J.M. Corchado, P. Novais, C. Analide (Eds.) Ambient Intelligence and Future Trends, 2010 ISBN Vol. 73. J.M. Corchado, P. Novais, C. Analide, J. Sedano (Eds.) Soft Computing Models in Industrial and Environmental Applications, 5th International Workshop (SOCO 2010), 2010 ISBN Vol. 74. M.P. Rocha, F.F. Riverola, H. Shatkay, J.M. Corchado (Eds.) Advances in Bioinformatics, 2010 ISBN Vol. 75. X.Z. Gao, A. Gaspar-Cunha, M. Köppen, G. Schaefer, and J. Wang (Eds.) Soft Computing in Industrial Applications, 2010 ISBN Vol. 76. T. Bastiaens, U. Baumöl, and B.J. Krämer (Eds.) On Collective Intelligence, 2010 ISBN Vol. 77. C. Borgelt, G. González-Rodríguez, W. Trutschnig, M.A. Lubiano, M.Á. Gil, P. Grzegorzewski, and O. Hryniewicz (Eds.) Combining Soft Computing and Statistical Methods in Data Analysis, 2010 ISBN Vol. 78. B.-Y. Cao, G.-J. Wang, S.-Z. Guo, and S.-L. Chen (Eds.) Fuzzy Information and Engineering 2010 ISBN Vol. 79. A.P. de Leon F. de Carvalho, S. Rodríguez-González, J.F. De Paz Santana, and J.M. Corchado Rodríguez (Eds.) Distributed Computing and Artificial Intelligence, 2010 ISBN Vol. 80. N.T. Nguyen, A. Zgrzywa, and A. Czyzewski (Eds.) Advances in Multimedia and Network Information System Technologies, 2010 ISBN Vol. 81. J. Düh, H. Hufnagl, E. Juritsch, R. Pfliegl, H.-K. Schimany, and Hans Schönegger (Eds.) Data and Mobility, 2010 ISBN Vol. 82. B.-Y. Cao, G.-J. Wang, S.-L. Chen, and S.-Z. Guo (Eds.) Quantitative Logic and Soft Computing 2010 ISBN Vol. 83. J. Angeles, B. Boulet, J.J. Clark, J. Kovecses, and K. Siddiqi (Eds.) Brain, Body and Machine, 2010 ISBN Vol. 84. Ryszard S. Choraś (Ed.) Image Processing and Communications Challenges 2 ISBN Vol. 85. Álvaro Herrero, Emilio Corchado, Carlos Redondo, and Ángel Alonso (Eds.) Computational Intelligence in Security for Information Systems 2010 ISBN

3 Álvaro Herrero, Emilio Corchado, Carlos Redondo, and Ángel Alonso (Eds.) Computational Intelligence in Security for Information Systems 2010 Proceedings of the 3rd International Conference on Computational Intelligence in Security for Information Systems (CISIS 2010) ABC

4 Editors Dr. Álvaro Herrero University of Burgos Civil Engineering Department Polytechnic School Francisco de Vittoria s/n Burgos Spain Emilio Corchado University of Salamanca Departamento de Informáca y Automática Facultad de Biología Plaza de la Merced s/n Salamanca Spain escorchado@usal.es Carlos Redondo Fundación Centro de Supercomputación de Castilla y León León Spain carlos.redondo@fcsc.es Ángel Alonso Fundación Centro de Supercomputación de Castilla y León León Spain ISBN e-isbn DOI / Advances in Intelligent and Soft Computing ISSN Library of Congress Control Number: c 2010 Springer-Verlag Berlin Heidelberg This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilm or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable for prosecution under the German Copyright Law. The use of general descriptive names, registered names, trademarks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. Typeset & Cover Design: Scientific Publishing Services Pvt. Ltd., Chennai, India. Printed on acid-free paper springer.com

5 Preface The 3 rd International Conference on Computational Intelligence in Security for Information Systems (CISIS 2010) provided a broad and interdisciplinary forum to present the most recent developments in several very active scientific areas such as Machine Learning, Infrastructure Protection, Intelligent Methods in Energy and Transportation, Network Security, Biometry, Cryptography, High-performance and Grid Computing, and Industrial Perspective among others. The global purpose of CISIS series of conferences has been to form a broad and interdisciplinary meeting ground offering the opportunity to interact with the leading research team and industries actively involved in the critical area of security, and have a picture of the current solutions adopted in practical domains. This volume of Advances in Intelligence and Soft Computing contains accepted papers presented at CISIS 2010, which was held in León, Spain, on November 11 12, CISIS 2010 received over 50 technical submissions. After a thorough peer-review process, the International Program Committee selected 25 papers which are published in this conference proceedings. This allowed the Scientific Committee to verify the vital and crucial nature of the topics involved in the event, and resulted in an acceptance rate close to 50% of the originally submitted manuscripts. The selection of papers was extremely rigorous in order to maintain the high quality of the conference and we would like to thank the members of the Program Committee for their hard work in the reviewing process. This is a crucial process to the creation of a conference high standard and the CISIS conference would not exist without their help. Our warmest and special thanks go to the Keynote Speakers: Prof. Ajith Abraham from MIR-Labs. EUROPE and Dr. Jorge Ramió Aguirre - Universidad Politécnica de Madrid (Spain). Particular thanks go as well to the conference main sponsors, namely Junta de Castilla y León, Supercomputing Center of Castilla y León, University of León, and Technical Co-Sponsors: IEEE - SECCION ESPAÑA, IEEE Systems, Man and Cybernetics-Spanish Chapter, MIR-Labs, and the International Federation for Computational Logic who jointly contributed in an active and constructive manner to the success of this initiative.

6 VI Preface We wish to thank Prof.Dr. Janusz Kacprzyk (Editor-in-chief), Dr. Thomas Ditzinger (Senior Editor, Engineering/Applied Sciences) and Mr. Holger Schaepe at Springer for their help and collaboration in this demanding scientific publication project. We thank as well all the authors and participants for their great contributions that made this conference possible and all the hard work worthwhile. November 2010 Álvaro Herrero Emilio Corchado Carlos Redondo Ángel Alonso

7 Organization Honorary Chairs Antonio Silván Rodríguez Carolina Blasco Consejero de Fomento - Regional Goverment of Castilla y León (Spain) Director of Telecommunication - Regional Goverment of Castilla y León (Spain) General Chair Emilio Corchado University of Salamanca (Spain) Program Committee Chairs Álvaro Herrero Emilio Corchado Carlos Redondo Ángel Alonso University of Burgos (Spain) University of Salamanca (Spain) University of León / FCSCL (Spain) University of León (Spain) Members Dr. Alberto Peinado Domínguez University of Malaga (Spain) Dr. Álvaro Herrero University of Burgos (Spain) Dr. Amparo Fúster Sabater CSIC (Spain) Dr. Andre CPLF de Carvalho University of São Paulo (Brazil) Dr. Angel Grediaga Olivo University of Alicante (Spain) Dr. Antonino Santos de Riego University of La Coruña (Spain) Dr. Antonio J. Tomeu Hardasmal University of Cadiz (Spain) Dr. Araceli Queiruga Dios Univsersity of Salamanca (Spain) Dr. Belén Vaquerizo University of Burgos (Spain) Dr. Bruno Baruque University of Burgos (Spain) Dr. Carlos Pereira Universidade de Coimbra (Portugal) Dr. Constantino Malagón Luque University Antonio de Nebrija (Spain) Dr. Dario Forte University of Milano Crema (Italy) Dr. David García Rosado University of Castilla la Mancha (Spain) Dr. Davide Leoncini University of Genova (Italy) Dr. Debasis Giri Haldia Institute of Technology (India) Dr. Domingo Gómez Pérez University of Cantabria (Spain)

8 VIII Organization Dr. Emilio Corchado University of Salamanca (Spain) Dr. Enrico Appiani Elsag Datamat (Italy) Dr. Enrique González Jiménez Autonomous University of Madrid (Spain) Dr. Fernando Tricas García University of Zaragoza (Spain) Dr. Francisco Plaza University of Salamanca (Spain) Dr. Francisco Rodríguez CINVESTAV IPN (México) Henríquez Dr. Gabriel López Millán University of Murcia (Spain) Dr. Gerald Schaefer Loughborough University (UK) Dr. Gonzalo Alvarez Marañón CSIC (Spain) Dr. Hujun Yin University of Manchester (UK) Dr. Isaac Agudo Ruiz University of Malaga (Spain) Dr. Javier Areitio Bertolín University of Deusto (Spain) Dr. Javier Carbó Rubiera Carlos III of Madrid University (Spain) Dr. Joan-Josep Climent University of Alicante (Spain) Dr. José Antonio Montenegro University of Malaga (Spain) Montes Dr. José Esteban Saavedra López University of Oruro (Bolivia) Dr. José Francisco Martínez INAOE (Mexico) Dr. José Francisco Vicent Francés University of Alicante (Spain) Dr. José Luis Salazar Riaño University of Zaragoza (Spain) Dr. Juan Guillermo EAFIT University (Colombia) Lalinde-Pulido Dr. Juan José Ortega Daza University of Malaga (Spain) Dr. Juan Manuel Corchado University of Salamanca (Spain) Dr. Judith Redi University of Genova (Italy) Dr. Julio Cesar Hernandez Castro University of Portsmouth (UK) Dr. Leandro Tortosa Grau University of Alicante (Spain) Dr. Leticia Curiel University of Burgos (Spain) Dr. Luis Enrique Sánchez Crespo University of Castilla la Mancha (Spain) Dr. Luis Hernández Encinas CSIC (Spain) Dr. Manuel Angel Serrano Martín University of Castilla la Mancha (Spain) Dr. Manuel Graña University of Pais Vasco (Spain) Dr. Marcos Gestal Pose University of La Coruña (Spain) Dr. María Victoria López López Complutense de Madrid University (Spain) Dr. Paolo Gastaldo University of Genova (Italy) Dr. Petro Gopych V.N. Karazin Kharkiv National University (Ukraine) Dr. Rafael Corchuelo University of Sevilla (Spain) Dr. Rafael Martínez Gasca University of Sevilla (Spain) Dr. Ramón Rizo Aldeguer University of Alicante (Spain) Dr. Ricardo Llamosa-Villalba Industrial University of Santander (Colombia) Dr. Roberto Uribeetxeberria University of Mondragon (Spain) Dr. Rodolfo Zunino University of Genova (Italy)

9 Organization IX Dr. Rosanna Costaguta Dr. Santiago Martín Acurio Del Pino Dr. Seema Verma Dr. Sergi Robles Martínez Dr. Sergio Decherchi Dr. Sorin Stratulat Dr. Tzai-Der Wang Dr. Urko Zurutuza Ortega Dr. Wenjian Luo Mr. Alain Lamadrid Vallina Mr. Ángel Arroyo Mr. Angel Martín del Rey Mr. Antonio Zamora Gómez Mr. Benjamín Ramos Alvarez Mr. Carlos Marcelo Martínez Cagnazzo Mr. Carlos Munuera Gómez Mr. Daniel Sadornil Renedo Mr. Diego Avila Pesantez Mr. Edgar Martínez Moro Mr. Eduardo Carozo Blumsztein Mr. Enrique Daltabuit Mr. Fausto Montoya Vitini Mr. Federico García Crespí Mr. Fernando Piera Gómez Mr. Francisco José Navarro Ríos Mr. Francisco Valera Pintor Mr. Guillermo Morales-Luna Mr. Jesús Esteban Díaz Verdejo Mr. Joaquín García-Alfaro Mr. Jordi Herrera Joancomarti Mr. Jorge Eduardo Sznek Mr. Jorge López Hernández-Ardieta Mr. José Daniel Britos Mr. José de Jesús Angel Angel Mr. José Luis Ferrer Gomila Mr. José Luis Rivas López Mr. José Manuel Benitez National University of Santiago del Estero (Argentina) Católica del Ecuador Pontificial University (Ecuador) Banasthali University (India) Autonomous University of Barcelona (Spain) University of Genova (Italy) University Paul Verlaine-Metz (France) Cheng Shiu University (Taiwan) University of Mondragon (Spain) University of Science and Technology of China (China) Department of High Education (Cuba) University of Burgos (Spain) Univsersity of Salamanca (Spain) University of Alicante (Spain) Carlos III of Madrid University (Spain) University of La República (Uruguay) University of Valladolid (Spain) University of Cantabria (Spain) Higher Polytechnic School of Chimborazo (Ecuador) University of Valladolid (Spain) University of Montevideo (Uruguay) National Autonomous of México University (México) Institute of Applied Physics (CSIC) (Spain) Miguel Hernández University (Spain) Computer Technicians Association (ATI) (Spain) University of Granada (Spain) Carlos III of Madrid University (Spain) CINVESTAV (Mexico) University of Granada (Spain) Carleton University (Canada) Autonomous University of Barcelona (Spain) Nacional del Comahue University (Argentina) Carlos III of Madrid University (Spain) National University of Córboda (Argentina) CINVESTAV IPN (México) University of Islas Baleares (Spain) University of Vigo (Spain) University of Granada (Spain)

10 X Organization Mr. Juan Pedro Hecht University of Buenos Aires (Argentina) Mr. Juan Tapiador University of York (UK) Mr. Lorenzo M. Martínez Bravo University of Extremadura (Spain) Mr. Mario Farias -Elinos La Salle University (Mexico) Mr. Mario Gerardo University of Castilla la Mancha (Spain) Piattini Velthuis Mr. Nicolás C.A. San Pablo Catholic University (Peru) Antezana Abarca Mr. Paul Mantilla Católica del Ecuador Pontificial University (Ecuador) Mr. Pedro Pablo Pinacho University of Santiago de Chile (Chile) Davidson Mr. Peter Roberts UCINF University (Chile) Mr. Pino Caballero Gil University of La Laguna (Spain) Mr. Rafael Calzada Pradas Carlos III of Madrid University (Spain) Mr. Roberto Gómez Cárdenas ITESM (México) Mr. Salvador Alcaraz Carrasco Miguel Hernández University (Spain) Mr. Sergio Bravo Silva University of Bío Bío (Chile) Mr. Sergio Pozo Hidalgo University of Sevilla (Spain) Mr. Vincenzo Mendillo Central University of Venezuela (Venezuela) Mrs. Ana Isabel Carlos III of Madrid University (Spain) González-Tablas-Ferreres Mrs. Candelaria Hernández Goya University of La Laguna (Spain) Mrs. Cristina Alcaraz Tello University of Malaga (Spain) Mrs. Lídice Romero Amondaray Oriente University (Cuba) Mrs. Mariemma I. Yagüe del University of Málaga (Spain) Valle Mrs. Raquel Redondo University of Burgos (Spain) Mrs. Rosaura Palma Orozco CINVESTAV IPN (México) Prof. Angela I. Barbero Díez University of Valladolid (Spain) Prof. Antoni Bosch Pujol Autonomous University of Barcelona (Spain) Prof. Antonio Maña Gómez University of Malaga (Spain) Prof. César Hervás-Martínez University of Córdoba (Spain) Prof. Danilo Pástor Ramírez Politechnic High School of Chimborazo (Ecuador) Prof. Enrique De la Hoz University of Alcalá (Spain) de la Hoz Prof. Fabián Velásquez Clavijo University of los Llanos (Colombia) Prof. Francisco University of Córdoba (Spain) Fernández-Navarro Prof. Gabriel Díaz Orueta UNED (Spain) Prof. Gustavo Adolfo Isaza University of Caldas (Colombia) Echeverri Prof. Hugo Pagola University of Buenos Aires (Argentina) Prof. Ignacio Luengo Velasco Complutense de Madrid University (Spain)

11 Organization XI Prof. Javier Fernando Castaño University of los Llanos (Colombia) Forero Prof. Javier Sánchez-Monedero University of Córdoba (Spain) Prof. Jose L. Salieron Pablo Olavide University (España) Prof. José Luis Imaña Complutense de Madrid University (Spain) Prof. Juan C. Fernández University of Córdoba (Spain) Prof. Juan Jesús Barbarán Sánchez University of Granada (Spain) Prof. Juan Tena Ayuso University of Valladolid (Spain) Prof. Juha Karhunen Helsinki University of Technology (Finland) Prof. Luis Alberto Pazmiño Proaño Católica del Ecuador Pontificial University (Ecuador) Prof. Luis Eduardo Meléndez Tecnológico Comfenalco University Campis Foundation (Colombia) Prof. Mario Mastriani GIIT-ANSES (Argentina) Prof. Pedro A. Gutiérrez University of Córdoba (Spain) Prof. Ramón Torres Rojas Marta Abreu de Las Villas Central University (Cuba) Prof. Reinaldo Nicolás Mayol University of Los Andes (Venezuela) Arnao Prof. Ricardo Contreras Arriagada University of Concepción (Chile) Prof. Richard Duro University of La Coruña (Spain) Prof. Rodrigo Adolfo Cofré LoyolaCatólica del Maule University (Chile) Organizing Committee Chairs Carlos Redondo Luis Muñóz Ángel Alonso University of León / FCSCL (Spain) FCSCL (Spain) University of León (Spain) Members Dr. Carlos Redondo Gil University of León (Spain) Dr. Manuel Castejón Limas University of León (Spain) Dr. Javier Alfonso Cendón University of León (Spain) Dr. Héctor Alaiz Moretón University of León (Spain) Dr. María del Carmen Benavides University of León (Spain) Cuéllar Dr. Isaías García Rodríguez University of León (Spain) Francisco Jesús Rodríguez Sedano University of León (Spain) Inmaculada González Alonso University of León (Spain) Ana María Díez Suárez University of León (Spain) Luis Angel Esquibel Tomillo University of León (Spain) Marcos Álvarez Díez University of León (Spain)

12 XII Organization Dr. Emilio Corchado Ángel Arroyo Dr. Bruno Baruque Dr. Álvaro Herrero Álvaro Alonso Santiago Porras Ruth Alonso Álvaro Fernandez University of Salamanca (Spain) University of Burgos (Spain) University of Burgos (Spain) University of Burgos (Spain) University of Burgos (Spain) University of Burgos (Spain) FCSCL (Spain) FCSCL (Spain)

13 Contents Chapter 1: Machine Learning and Intelligence An Incremental Density-Based Clustering Technique for Large Datasets... 3 Saif ur Rehman, Muhammed Naeem Ahmed Khan BSDT ROC and Cognitive Learning Hypothesis Petro Gopych, Ivan Gopych Evolving Fuzzy Classifier for Data Mining An Information Retrieval Approach Pavel Krömer, Václav Snášel, Jan Platoš, Ajith Abraham Mereotopological Analysis of Formal Concepts in Security Ontologies Gonzalo A. Aranda-Corral, Joaquín Borrego-Díaz Chapter 2: Agents and Multi-Agent Systems A Multi-agent Data Mining System for Defect Forecasting in a Decentralized Manufacturing Environment Javier Alfonso Cendón, Ana González Marcos, Manuel Castejón Limas, Joaquín Ordieres Meré A Distributed Hierarchical Multi-agent Architecture for Detecting Injections in SQL Queries Cristian Pinzón, Juan F. De Paz, Álvaro Herrero, Emilio Corchado, Javier Bajo Incorporating Temporal Constraints in the Analysis Task of a Hybrid Intelligent IDS Martí Navarro, Emilio Corchado, Vicente Julián, Álvaro Herrero

14 XIV Contents Chapter 3: Image, Video and Speech Processing Performances of Speech Signal Biometric Systems Based on Signal to Noise Ratio Degradation Dzati Athiar Ramli, Salina Abdul Samad, Aini Hussain Lipreading Using n Gram Feature Vector Preety Singh, Vijay Laxmi, Deepika Gupta, M.S. Gaur Face Processing for Security: A Short Review Ion Marqués, Manuel Graña Chapter 4: Network Security Ontologies-Based Automated Intrusion Response System Verónica Mateos Lanchas, Víctor A. Villagrá González, Francisco Romero Bueno Semi-supervised Fingerprinting of Protocol Messages Jérôme François, Humberto Abdelnur, Radu State, Olivier Festor Monitoring of Spatial-Aggregated IP-Flow Records Cynthia Wagner, Gerard Wagener, Radu State, Thomas Engel Improving Network Security through Traffic Log Anomaly Detection Using Time Series Analysis Aitor Corchero Rodriguez, Mario Reyes de los Mozos A Threat Model Approach to Threats and Vulnerabilities in On-line Social Networks Carlos Laorden, Borja Sanz, Gonzalo Alvarez, Pablo G. Bringas An SLA-Based Approach for Network Anomaly Detection Yasser Yasami Understanding Honeypot Data by an Unsupervised Neural Visualization Álvaro Alonso, Santiago Porras, Enaitz Ezpeleta, Ekhiotz Vergara, Ignacio Arenaza, Roberto Uribeetxeberria, Urko Zurutuza, Álvaro Herrero, Emilio Corchado Chapter 5: Watermarking Permuted Image DCT Watermarking Reena Gunjan, Saurabh Maheshwari, M.S. Gaur, Vijay Laxmi

15 Contents XV A Developed Watermark Technique for Distributed Database Security Hazem M. El-Bakry, Mohamed Hamada Chapter 6: Cryptography Trident, a New Pseudo Random Number Generator Based on Coupled Chaotic Maps Amalia Beatriz Orúe López, Gonzalo Álvarez Marañon, Alberto Guerra Estévez, Gerardo Pastor Dégano, Miguel Romera García, Fausto Montoya Vitini The Impact of the SHA-3 Casting Cryptography Competition on the Spanish IT Market Manuel J. Martínez, Roberto Uribeetxeberria, Urko Zurutuza, Miguel Fernández Chapter 7: Industrial and Commercial Applications of Intelligent Methods for Security A New Task Engineering Approach for Workflow Access Control Hanan El Bakkali, Hamid Hatim, Ilham Berrada OPBUS: Fault Tolerance Against Integrity Attacks in Business Processes Angel Jesus Varela Vaca, Rafael Martínez Gasca A Key Distribution Scheme for Live Streaming Multi-tree Overlays Juan Álvaro Muñoz Naranjo, Juan Antonio López Ramos, Leocadio González Casado Intelligent Methods for Scheduling in Transportation M a Belén Vaquerizo García Author Index

16 Chapter 1 Machine Learning and Intelligence

17 An Incremental Density-Based Clustering Technique for Large Datasets Saif ur Rehman and Muhammed Naeem Ahmed Khan 1 Abstract. Data mining, also known as knowledge discovery in databases, is a statistical analysis technique used to find hidden patterns and identify untapped value in large datasets. Clustering is a principal data discovery technique in data mining that segregates a dataset into subsets or clusters so that data values in the same cluster have some common characteristics or attributes. A number of clustering techniques have been proposed in the past by many researchers that can identify arbitrary shaped cluster; where a cluster is defined as a dense region separated by the low-density regions and among them DBSCAN is a prime density-based clustering algorithm. DBSCAN is capable of discovering clusters of any arbitrary shape and size in databases which even include noise and outliers. Many researchers have attempted to overcome certain deficiencies in the original DBSCAN like identifying patterns within datasets of varied densities and its high computational complexity; hence a number of augmented forms of DBSCAN algorithm are available. We present an incremental density-based clustering technique which is based on the fundamental DBSCAN clustering algorithm to enhance its computational complexity. Our proposed algorithm can be used in different knowledge domains like image processing, classification of patterns in GIS maps, x-ray crystallography and information security. Keywords: Clustering Techniques, DBSCAN, Data Mining, Statistical Analysis, Knowledge Discovery in Databases. 1 Introduction Data mining is the process of extracting hidden and interesting distinctive patterns and affinities from large datasets. The extracted patterns, rules and relationships serve as a valuable tool in the process of decision-making and future prediction. Saif ur Rehman Muhammed Naeem Ahmed Khan Department of Computer Science, SZABIST, Islamabad, Pakistan {saifi.ur.rehman,mnak2010}@gmail.com Á. Herrero et al. (Eds.): CISIS 2010, AISC 85, pp springerlink.com Springer-Verlag Berlin Heidelberg 2010

18 4 S. ur Rehman and M.N. Ahmed Khan To make use of the extracted information, the availability of efficient and effective analysis methods are imperative. One of such method is clustering where a dataset of objects is divided into several clusters where the intra-cluster similarity is maximized and the inter-cluster similarity is minimized [2]. In the past, large number of clustering algorithms has been. The clustering techniques are categorized into partitioning, hierarchal, grid-based, density-based and model-based. Under the partitioning category, the foremost techniques include PAM [15], CLARA [15] and CLARANS [3]. The well known algorithms of hierarchical category are CURE [4] and CHEMELEON [5]. The grid-based clustering techniques include CLIQUE [6], ENCLUS [7] and WaveCluster [8]. Among the density-based clustering techniques, DBSCAN [9], DENCLUE [10] and OPTICS [11] are generally popular. Paradigms of the model-based clustering techniques are COBWEB [12] and SOM [13]. Algorithms proposed under each of these clustering categories strive to discover proximities of the data objects on the basis of certain characteristics of one or more attributes. The density-based clustering algorithms typically form cluster as dense regions of points in the data space that are separated by regions of low density. DBSCAN [9] is the first and leading density-based clustering technique. DBSCAN forms clusters with respect to a density based connectivity analysis. In this paper, we propose a new density-based clustering technique which is an extension of the original DBSCAN algorithm and is based on the idea to look for clusters in the datasets in an incremental fashion starting from an arbitrary data point and keep on exploring the other data points which are in its close proximity to formulate clusters. We have endeavored to improve time complexity of DBSCAN algorithm and overcome its key problem of dependency on the user to supply threshold values. For a given sorted dataset of objects, our algorithm firstly calculates the density thresholds and explores for the similarity of dataset objects to form clusters. The proposed technique has been evaluated using twodimensional dataset which has exhibited faster cluster identification and improved efficiency. This paper consists of seven sections. This section of the paper describes a brief introduction to clustering techniques. A simplified description of DBSCAN algorithm is provided in Section II. Section III highlights different DBSCAN variations proposed by different researchers as an augmentation to the original DBSCAN algorithm. The model for our proposed incremental DBSCAN method is described in Section IV. Section V highlights the computational complexity of our proposed algorithm. Experimental details and conclusion along with future directions are discussed in Section VI and VII respectively. 2 What Is DBSCAN? The DBSCAN is the foremost and primary density based clustering algorithm. It was proposed by Ester et al. [9] in 1996 with the key objective to cluster data points of arbitrary shapes in the presence of noise in spatial and non-spatial high dimensional databases. The key idea of DBSCAN is based on the concept that for each object of a cluster, the neighborhood of a given radius (named as Eps) should

19 An Incremental Density-Based Clustering Technique for Large Datasets 5 contain at least a minimum number of objects (MinPts); this means that cardinality of the neighborhood needs to satisfy or exceed certain threshold. However, the threshold is not a fixed value; rather it is purely defined by the user. Hence, the most drawn in concepts in DBSCAN are: ε -neighborhood and MinPts. The ε -neighborhood of an arbitrary data point P is defined as: N EPS ( P ) = { q Є D / dist ( P,q ) < Eps} (1) Where, D is the database of objects. If theε -neighborhood of a data point P contains at least a minimal number of points then that point is called core point. Therefore, a core point is defined as: N EPS ( P ) > M inp ts (2) DBSACN searches for the clusters by checking the ε -neighborhood of each data point or object in the dataset. If the ε -neighborhood of a data point P contains more data objects than the MinPts threshold, and then a new cluster with P as its core point is formed. The algorithm then iteratively collects directly densityreachable data points from the core points, which may possibly require merging the new density-reachable data points into the previously created cluster. This process terminates when no new data object can be added to any cluster [6]. 3 Related Work The Liu et al [14] proposed VDBSCAN algorithm which is the modified version of DBSCAN algorithm. In VDBSCAN, the authors chalked out a strategy to make the existing density-based algorithm more efficient and scalable by extending the original DBSCAN algorithm to analyze the datasets having varied densities. The density threshold values are calculated for different dataset densities according to a K-dist plotting. Later, the clustering algorithm is applied by using the calculated values of the Eps parameter. In 2004, El-Sonbaty et al. [2] provided an enhancement version of DBSCAN. In the preprocessing step, the dataset being analyzed is partitioned using CLARANS [3] technique. By virtue of the partitioning of dataset, searching efforts to locate the core object is minimized. The major achievements of their study are: (i) it takes a lesser amount of time to cluster the dataset by partitioning dataset and limiting the search space to only of the partitioned data objects rather than the whole dataset; and (ii) as the dataset is partitioned into smaller object set, therefore, a smaller buffer size is required for holding the partitioned dataset. FDBSCAN [16] algorithm was introduced by Bing Liu to overcome some of the DBSCAN limitations including: (i) its slow speed (deceleration in neighborhood query due to comparisons involved for each object); and (ii) setting the threshold value. It is a time efficient algorithm as it decreases the computational time by ignoring the region objects which are already clustered. An enhancement of DBSCAN algorithm is provided by Ram et al. [17] who named their algorithm as EDBSCAN. Their investigation discovers the key problem in DBSCAN of not handling the local density variation within the cluster and proposes that a

20 6 S. ur Rehman and M.N. Ahmed Khan significant density variation needs to be allowed within the cluster to enhance its performance. EDBSCAN uses two user s specified parameters δ andτ, which are used to specify the cutoff points to limit the amount of allowed local density variation within the cluster. 4 Incremental DBSCAN Algorithm In this section, we present a new enhancement to the fundamental clustering algorithm DBSCAN and highlight the different processes evolved in enhancing DBSCAN. Further we evaluate the performance of our incremental DBSCAN using a two-dimensional dataset used in [17]. Our algorithm is originally based on DBSCAN [17] and enhances DBSCAN in three different ways. In the first step, the sorting on dataset is performed. In the next step, the region query on the dataset is performed in order to locate those points that will be included into clusters. In the final step, merging on the clusters resulted from the region query is carried out. This merging produces the final clustering results. In this algorithm we have used a special variable called skipping the seed (seedskipcount) and are used to skip those points which have not enough neighbors to form a cluster. A detailed description of these steps is given below. 4.1 Dataset Sorting In the proposed algorithm, the dataset is arranged in sorted order, either ascending or descending, and all the data points are initially flagged as noise (meaning that these data points do not belong to any cluster). During the execution of the algorithm, flag of those data points that constitute a cluster is changed. Hence, at the end of the algorithm execution, only those points that have not been used in forming clusters are automatically singled out as noise. 4.2 Region Query Once dataset is organized in some order, we perform the region query starting from the point from the origin i.e., both x and y coordinates have zero index. In this process, for a given Eps value, our algorithm checks the neighborhood of the data point currently being analyzed in horizontal, vertical and diagonal directions. The neighbors of a data point consists of all the data points that fall within the specified Eps distance value in the above mentioned directions. For a discrete dataset that constitutes a XY-plane, our algorithm can possibly inspect up to eight data points in the neighborhood of a current data point for a unit value of Eps. Table 1 illustrates the criteria used to establish the neighborhood of a data point based on x- and y-coordinates. The neighborhood criteria mentioned in the above table can graphically be depicted as shown in Fig. 1. In the neighborhood checking process, all those data points that are at the Eps distance and pass the MinPts criteria are merged into the

21 An Incremental Density-Based Clustering Technique for Large Datasets 7 cluster and are assigned the corresponding clusterid. If the data points in the inspection of region query do not satisfy the MinPts criteria, then the seedskipcount variable is incremented. When the value of seedskipcount variable becomes two or more then a new clusterid is generated and next available data point in the dataset with label NOISE is chosen and region query check is carried out to verify the existence of another possible cluster. This whole process is iteratively carried out until all the data points in the dataset are checked. Table 1 Criteria to establish neighborhood for a data point having coordinates as (X, Y). Sizes of headings Increment/Decrement Coordinates Increment X and Y by 1 Increment X by 1 and decrement Y by 1 Increment X by 1 Increment Y by 1 Decrement X by 1 and increment Y by 1 Decrement X and Y by 1 Decrement Y by 1 Decrement X by 1 New Points (X+1,Y+1) (X+1,Y-1) (X+1,Y) (X,Y+1) (X-1,Y+1) (X-1,Y-1) (X,Y-1) (X-1,Y) Fig. 1 Neighborhood Criteria to check the core point 4.3 Cluster Merging Merging of clusters is the final stage in our proposed algorithm. In the merging process, two or more adjoining clusters with the identical characteristics are merged together to form a bigger cluster. The logic for merging clusters is that there is a possibility of formation of new small clusters due to the MinPts criteria. The MinPts criteria require the existence of some specific number of data points in

22 8 S. ur Rehman and M.N. Ahmed Khan the close proximity of a core data point. In certain cases, there is a possibility that some data points may exist in the neighborhood of a core point but their number is less than the required MinPts. In the incidence of such a situation, the algorithm will form a new cluster by skipping the seed points. This scenario leads to a new notion, although the neighbors of a core point are less than the required MinPts but the neighbors of any of region query point may satisfy MinPts criteria. In such a situation, the adjacent clusters need to be merged. An abridged version of our Incremental DBSCAN is provided below. function Incremental_DBSCAN (setofobjects, Eps,MinPts) //call the sort method to sort the dataset points Objects = Sort_Objects(setOfObjects) //Objects is Multidimensional Array of data points setofobjects.flag = NOISE; //Initialize each point with a NOISE flag //set the first clusterid to 1 ClusterID = 1; //Variable to find co-ordinates to skip for checking SeedSkipConut = 0; //loop through the dataset points For loopvar = 1 to setofobjects.size //get the co-ordinates of the next point in the dataset x = setofobjects.getnextpoint(loopvar); //verify that the point has not been clustered before If setofobjects(loopvar).flag = NOISE then seed = RegionQuery (X,Y,Eps); //check the results returned from the RegionQuery method //for labeling the points with the current clusterid If seed.size >= MinPts then For Each seed Do setofobjects(seed).flag = ClusterID; End For Each //assign the clusterid to the current point //in the dataset at loopvar location setofobjects(loopvar).flag = ClusterID; else SeedSkipCount ++; If SeedSkipCount >= 2 ClusterID ++; SeedSkipCount = 0; End If End If End If //loop ends here End for End; //call to merge method to merge the clusters with //matching characteristics Merge_Cluster(); // end of main function End;

23 An Incremental Density-Based Clustering Technique for Large Datasets 9 5 Computational Complexity The execution or computational complexity of our proposed algorithm mainly depends on the number of iterations carried out during the execution of its core loop and multiple references to the function RegionQuery (). The core loop is only executed once and is done in a linear time. Within the core loop, the Region- Query() is called maximum of eight times to check the neighborhoods of a point so its execution time can be consider as constant. Hence, the computational complexity of our algorithm is linear i.e. O (n). However, our proposed algorithm requires sorting the dataset first. If sorting of dataset is carried out by using merge sort whose computational complexity is O (n log (n)), then the overall execution complexity of our proposed algorithm will be O (n) + O (n log (n)). The simplification of this computational complexity reduces to O (n log (n)), hence, the overall computational complexity of our incremental DBSCAN algorithm is O(n log(n)). Moreover, the experimental results described in Section VI show that the computational efficiency of our algorithm is much better than the fundamental DBSCAN algorithm respectively. 6 Experimental Details Two dimensional dataset used in [17] is taken as test data to evaluate the performance of our proposed algorithm. We implemented our proposed algorithm trough m-file programming in Matlab version 7.7. The experiment was carried out on a Pentium system having 2GB RAM and 3.0 GHz processor. The test dataset was a two dimensional data consisted of 1000 data objects. The dataset was first sorted in ascending order with respect to the x-coordinate values. The sort dataset was then plotted in Matlab as shown in Fig. 2. The Fig. 3 depicts the cluster identified by our proposed algorithm. In this experiment, we set Eps=0.1 and MinPts = 5. As evident from Figure 5, our algorithm successfully discovered four clusters in the dataset. However, by changing the values of Eps and MinPts parameters, different number and sizes of clusters can be obtained. Fig. 2 Plain drawing of 1000 data objects

24 10 S. ur Rehman and M.N. Ahmed Khan Fig. 3 Marking of clusters in the dataset using our algorithm 7 Conclusion and Future Work The In this paper, we have presented a new density-based clustering algorithm. This algorithm is an attempt to improve the clustering results of the basic DBSCAN clustering algorithm. To enhance the efficiency of cluster identification process, we perform region query for each data point that is marked as a member of the cluster. At the end, clusters are merged to obtain the final clustering results. The experimental results demonstrate that our proposed clustering technique of is promising as it not only identifies the correct number of clusters in a dataset but also its computational complexity is lesser than the fundamental DBSCAN algorithm. There could be further improvement to our proposed algorithm which will be part of the future research direction. Some of the future prospective include: analyzing the clustering results by using dataset of different sizes with variable number of MinPts, verifying its efficiency using the disparate nature and dimensions of datasets. Moreover, modifying our incremental density-based algorithm to handle continuous datasets is also a potential future work. References 1. Fahim, A.M., Salem, A.M., Torkey, F.A., Ramadan, M.A.: Density Clustering Based on Radius of Data (DCBRD). World Academy of Science, Engineering and Technology (2006) 2. El-Sonbaty, Y., Ismail, M.A., Farouk, M.: An Efficient Density Based Clustering Algorithm for Large Databases. In: Proceedings of the 16th IEEE International Conference on Tools with Artificial Intelligence (2004)