GE-INTERNATIONAL JOURNAL OF ENGINEERING RESEARCH VOLUME -3, ISSUE-6 (June 2015) IF ISSN: ( ) EMERGING CLUSTERING TECHNIQUES ON BIG DATA

Transcription

1 EMERGING CLUSTERING TECHNIQUES ON BIG DATA Pooja Batra Nagpal 1, Sarika Chaudhary 2, Preetishree Patnaik 3 1,2,3 Computer Science/Amity University, India ABSTRACT The term "Big Data" defined as enormous data sets having a large more diverse and complex structure of representation that creates difficulty in storing, analyzing searching and visualization process. This process of execution of the massive data sets into a secrete correlation pattern called as "Big Data Mining" which implies the same concept of discovery the hidden and relevant data through various principles of Data Mining. Clustering algorithms have emerged as an alternative powerful and meta-learning tool helps to analyze the massive volume of data (Big Data) generated by many applications. In general, we studied that Big Data creates a lot of confusion while categorization of Big Data. Therefore the most relevant clustering algorithm must be used to classify the Big Data. In this paper we tried to explain the challenges that big data faced today and survey analysis of different clustering techniques used in Big Data Analytics. KEYWORDS - BIG DATA, CLUSTERING, DATA MINING, CLUSTERING ALGORITHMS. 1. INTRODUCTION The term Big Data encompasses of all forms of data, including Web logs, data from social networking sites, sensor data, tweets, blogs, user reviews, and SMS messages. Big data and big data analytics are in the recent study of information technology and business intelligence. These data are generated from various social networking sites like Facebooks, twitter, etc,online transactions, s, videos, audios, images, click streams, logs, posts, search queries, health records, science data, sensors Smart phones and their applications [1]. These data are in different format, hence required for databases to store and analyze the data sets and visualize via typical database software tools. In comparison to past decades the primary IT Industry has changes a lot, with more fast transaction people are accessing huge amount of data in various pattern e.g. Internet mails, video, images, audio messages, sensors data streams and etc with such huge accessibility of data makes a revolutionary change in analysis of data streams patterns.thus the Data Scientists has announced that we are now in the Era of Big data or we are sinking to deep water of big data every day. Page 22

2 Today, we accepted ourselves in the era of digitalization with gigantic progress, development of technologies, web media, social networking sites, online world technologies through internet, Smartphone.etc where every user are accessing enormous /massive quantities of data from various data sources. Such enormous data sets having massive, diverse and complex structure of data is term as Big Data. These massive data creates a lot of difficulties in storing, analyzing, searching and visualization process. But we know that this massive volume of data sets can be useful to user in various aspects and creates lots of confusion in its storing and analyzing. Therefore,a big massive of data sets(big DATA) are need to be store in effective and efficient manner that helps in various type of operations(i.e. analytical operation, process operations, retrieval, reliability of data & etc). Thus it is most important to execution of these massive data sets into a secrete correlation/pattern/cluster models that makes easy of its utilization through implementation various types of clustering techniques, Data mining methods. 2. CATEGORIZATION OF BIG DATA. Although the huge volume of data (Big Data) can be actually useful for users but also creates a lot of problematic in storing and analyzing. Therefore, a big volume of data or big data has its own deficiencies as well. They need big storages and this volume makes operations such as analytical operations, process operations, retrieval operations, very difficult and hugely time consuming. One way to overcome these difficult problems is to have big data clustered in a compact format that is still an informative version of the entire data. Such clustering techniques aim to produce a good quality of clusters/summaries. Therefore, they would hugely benefits everyone from ordinary users to researchers and people in the corporate world, as they could provide an efficient tool that helps with large data such as critical systems (to detect cyber attacks). In the below figure depicts the categorizations of BIG DATA. [1][2] 2.1 VARIETY Big data come from a great range of sources and a further volume of data source is categorized into three types: structured, semi structured and unstructured STRUCTURED DATA The structures data are organized manner easily sorted to store in database.these variety Data include the abstract data type, web links, pointers etc UNSTRUCTURED DATA The unstructured data are random and difficult to analyze. These are Heterogeneous and raw/incomplete data that are generated from multiple users in different sources. (e.g.: Bitmap images, objects, text, etc). Page 23

3 2.1.2 SEMI- STRUCTURED DATA These are the combination of structure and un-structured data and doesn t conforms to a fixed set of tags or others semantics structure of data. [4] 2.2 VOLUME Volume or the size of data has been larger than terabytes and petabytes. The grand scale and rise of data outstrips fixed store and analysis technique. As the Big data size is massive and huge in nature, so it s a biggest challenge for the data scientist to design the large database for its effective storage and visualization. [1][4] 2.3 VELOCITY The range of data used is in max range, Velocity is a necessary parameter not only for big data, but also all processes. For time limited processes to be executed, big data used should be in organization streams to have a maximize value [1][4] 2.4 VERACITY These types of data are generally uncertainty due to inconsistency and ambiguities latency. FIGURE2. THE FOUR V S OF BIG DATA [1][2]. 3. TAXONOMY OF CLUSTERING The term clustering or cluster analysis was first coined by Driver and Kroebar which is famous for unsupervised learning method of Data Mining. However different scientist developed different types of clustering algorithms that varies in their properties, clustering models and etc. In general clustering can be defined as is a process of grouping a set object into a class of similar objects. Or Clustering is a process of division of DATA into a group of similar objects. The shape and size of Page 24

4 cluster formation and visualization varies from one another with their respective properties of the algorithm. Despite from huge number of survey for clustering algorithms available for various domains i.e. machine learning, information retrieval, pattern recognition, bio-informatics, semantic medical sciences.it makes difficult to user to decide which algorithm is appropriate to analysis the massive data sets. Therefore we have implements the taxonomy of clustering algorithms and propose these classifications to develop a frame work that covers major factors in selecting suitable algorithms for massive data sets. The clustering Algorithms are broadly classified into four categories which are as follows Partitioning based clustering In such type of algorithms, all clusters are determined promptly. Initial groups are specified and reallocated towards a union. In other words, the partitioning algorithms divide data objects into a number of partitions, where each partition represents a cluster. These clusters should fulfill the following requirements as each group must contain at least one object, and they must belong to exactly one group. There are many other partitioning algorithms such as K- modes, PAM, CLARA, CLARANS and FCM Hierarchical-based clustering This type of clustering method is also known as Connectivity based clustering. Data are organized in a hierarchical manner depending on the medium of proximity. Proximities are obtained by the intermediate nodes. A dendrogram (Greek word represents a tree structure) the datasets, where individual data is presented by leaf nodes. The initial cluster gradually divides into several clusters as the hierarchy continues. Hierarchical clustering methods are of two types: a) Agglomerative (bottom- up) b) Divisive (top-down). An agglomerative clustering starts with one object for each cluster and recursively merges two or more of the most appropriate clusters. A divisive clustering starts with the dataset as one cluster and recursively splits the most appropriate cluster. The process continues until a stopping criterion is reached (frequently, the requested number k of clusters) Density-based clustering In density-based clustering, clusters are defined as areas of higher density than the remainder of the data set. Objects in these sparse areas that are required to separate clusters are usually considered to be noise and border points. Here, data objects are separated based on their regions of density, connectivity and boundary. They are closely related to point-nearest neighbors. A cluster defined as a connected dense component grows in any direction that density leads to. Therefore, density-based algorithms are capable of Page 25

5 discovering clusters of arbitrary shapes. Also, this provides a natural protection against outliers. Thus the overall density of a point is analyzed to determine the functions of datasets that influence a particular data point. DBSCAN, OPTICS, DBCLASD and DENCLUE are algorithms that use such a method to filter out noise and discover clusters of arbitrary shape GRID-BASED CLUSTERING The space of the data objects is divided into grids (cells). The main advantage of this approach is its fast processing time, because it goes through the dataset once to compute the statistical values for the grids. The accumulated grid-data make grid-based clustering techniques independent of the number of data objects that employ a uniform grid to collect regional statistical data, and then perform the clustering on the grid, instead of the database directly. The performance of a grid-based method depends on the size of the grid, which is usually much less than the size of the database. However, for highly irregular data distributions, using a single uniform grid may not be sufficient to obtain the required clustering quality of the time requirement. Wave-Cluster and STING are typical examples of this category. The various criteria of clustering methods in big data. big data [13] In this following section, we explain in detail the corresponding criterion of each property of 4. SELECTION CRITERIA 4.1.TYPES OF DATASET The majority of the traditional clustering algorithms are designed to focus either on numeric data or on categorical data. They collected data in the real world which contain both numeric and categorical attributes. But the drawback is for applying traditional clustering algorithm directly into these kinds of data. The Clustering algorithms work effectively either on purely numeric data or on purely categorical data; most of them perform poorly on mixed categorical and numerical data types SIZE OF DATASET The size of the dataset has a major effect on the clustering quality. Some clustering methods are more efficient clustering methods than others when the data size is small, and vice versa INPUT PARAMETER A desirable feature for practical clustering is the one that has fewer parameters, since a large number of parameters may affect cluster quality because they will depend on the values of the parameters HANDLING OUTLIERS/ NOISY DATA Page 26

6 A successful algorithm will often be able to handle outlier/noisy data because of the fact that the data in most of the real applications are not pure. Also, noise makes it difficult for an algorithm to cluster an object into a suitable cluster. This therefore affects the results provided by the algorithm TIME COMPLEXITY Most of the clustering methods must be used several times to improve the clustering quality. Therefore if the process takes too long, then it can become impractical for applications that handle big data STABILITY One of the important features for any clustering algorithm is the ability to generate the same partition of the data irrespective of the order in which the patterns are presented to the algorithm HANDLING HIGH DIMENSIONALITY This is particularly important feature in cluster analysis because many applications require the analysis of objects containing a large number of features (dimensions). For e.g: text documents may contain thousands of terms or key words as features. It is challenging due to the curse of dimensionality. Many dimensions may not be relevant. As the number of dimensions increases, the data become increasingly sparse, so that the distance measurement between pairs of points becomes meaningless and the average density of points anywhere in the data is likely to be low CLUSTER SHAPE A good clustering algorithm should be able to handle real data and their wide variety of data types, which will produce clusters of arbitrary shape. 5. CLUSTERING ALGORITHMS In the below section we discusses each of the selected algorithms in details with the pseudo code and survey analysis of this algorithm along with its strengths and weakness. 5.1.BRICH BIRCH is data clustering method named as (Balanced Iterative Reducing and Clustering using Hierarchies) which is an example of hierarchical based clustering method. This algorithm generates a dendogram called as CF-Tree (clustering feature tree). Steps of BIRCH algorithm: The CF tree will first scan the dataset in an incremental order.the scanning of dataset is executed in two main phases: First scan the database to build a memory tree and then apply the clustering to the leaf nodes. The CF tree is a height balanced tree which includes two parameter as branching factor (B) and threshold (T). CF tree is construct during the scanning the dataset and the tree is traversed from the root node with selecting a closest node at each level. If the closest node at Page 27

7 each level is identified then test is performs to candidate datasets BIRCH can typically discover a good clustering with a single scan of the dataset and improve the quality of the algorithm processing with a few additional scans. It can also handle noise effectively. But, BIRCH algorithm is not applicable for spherical data sets and cluster does because it uses the concept of radius or diameter to control the boundary of a cluster. In addition, it is order-sensitive and may generate different clusters for different orders of the same input data. In the below figure he details of the algorithm are given below FIGURE [3]: BRICH ALGORITHM PSEUDO-CODE. [13] 5.2 DBSCAN DBSCAN is a density based clustering method i.e. Density based spatial clustering of application with noise. This algorithm grows with sufficiently high density into cluster and helps to discover of cluster in any arbitrary shape in spatial database with noise. The main objective of density based clustering is that for each object of cluster the Neighborhood radius(eps) has contain at least minimum no of objects (Minpts) which helps to locate the cardinality of neighborhood cluster and its threshold value. DENCLUE is an example of density based clustering algorithm.this algorithm states that a analytically models the cluster distribution according to the sum of influence functions of all of the data points and the influence function can be seen as a function that describes the impact of a data Page 28

8 point within its neighborhood. The cluster formation in this method is done by density attractor and the local maxima of the overall density function. In this algorithm applicable for clusters of arbitrary shape can be easily described by a simple equation with kernel density functions. Even though DENCLUE requires a careful selection of its input parameters (i.e. σ and ξ), as this input parameter play important role in cluster formation and quality outputs. It has several advantages in comparison to other clustering algorithms as : a) It has a solid mathematical foundation and generalized other clustering methods, such as partitioned and hierarchical; b) it has good clustering properties for datasets with large amount of noise; c) It allows a compact mathematical description of arbitrarily shaped clusters in high-dimensional datasets; and d) It uses grid cells and only keeps information about the cells that actually contain points. It manages these cells in a tree-based access structure All of these properties make DENCLUE able to produce good clusters in datasets with a large amount of noise. The details of this algorithm are given below: Page 29

9 Figure [3]: DENCLU Algorithm pseudo-code. [13] GMDBSCAN GMDBSCAN is part of density and grid based algorithm which can work on high dimensional datasets with developing clustering of any arbitrary shapes. This algorithm is known as Grid Multi density based clustering with noise.[14] Steps of GMDBSCAN algorithm: a) Consider the input data sets and make the data set in standardization form. b) Now divide the data space into grid c) Apply statistics to grid density and construct the SP-tree d) Construct bit map forming e) Local clustering of each individual and merging them to similar sub cluster f) Eliminate noise and border processing. Page 30

10 The below figure depicts the details algorithm for GMDBSCAN. FIGURE [3]: GMDBSCAN ALGORITHM PSEUDO-CODE.[14] CONCLUSION In this paper we have done a comprehensive of Big Data and its categorization on the basis of data accessibility and define the challenges that big data are facing today for storing, sorting, and analyzing. we disclosure different types of clustering algorithms.as future work we suggest and investigate different types of data clustering algorithms and its implementation to big data, optimize and calculate the efficiency of such algorithm suitable to handle massive Big Data and applicable for multi dimensional data sets. REFERENCES [1] Seref SAGIROGLU and Duygu SINANC Gazi University, Big Data : A Review. [2] Marko Grobelnik marko.grobelnik@ijs.si Jozef Stefan Institute Ljubljana, Slovenia,May 8th 2012, Big Data Tutorial [3] images [4] last access 29th January 2015 Page 31

11 [5] last access29th January 2015 [6] A. A. Abbasi and M. Younis. A survey on clustering algorithms for wireless sensor networks. Computer communications, 30(14): , [7] C. C. Aggarwal and C. Zhai. In Mining Text Data, pp Springer, 2012 A survey of text clustering algorithms. [8] Ku Ruhana Ku-Mahamud Universiti Utara Malaysia, Malaysia, ruhana@uum.edu.my. BIG DATA CLUSTERING USING GRID COMPUTING AND ANTBASED ALGORITHM [9] Seref SAGIROGLU and Duygu SINANC Gazi University Department of Computer Engineering, Faculty of Engineering Ankara, Turkey ss@gazi.edu.tr, duygusinanc@gazi.edu.tr. Big Data, A survey [10] JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 7 SE Master Course 2008/2009. Data Mining clustering techniques lectures. [11] XiaoCai,FeipingNie,HengHuang University of Texas at Arlington Arlington, Texas, xiao.cai@mavs.uta.edu, feipingnie@gmail.com, heng@uta.edu- Multi-View K-means clusteringonbigdata [12] Future Wei Fan Huawei Noah s Ark Lab Hong Kong Science Park Shatin, Hong Kong david.fanwei@huawei.com Albert Bifet Yahoo! Research Barcelona Av. Diagonal 177 Barcelona, Catalonia, Spain abifet@yahoo-inc.com. Mining Big Data: Current Status, and Forecast to the Future. [13] A. Fahad, N. Alshatri, Z. Tari, Member, IEEE, A. Alamri, I. Khalil A. Zomaya, Fellow, IEEE, S. Foufou, and A. Bouras. A Survey of Clustering Algorithms for Big Data: Taxonomy & Empirical Analysis. [14] C. Xiaoyun, M. Yufang, Z. Yan and W.Ping, School of Information science and Engineering, Lanzhou University. GMDBSCAN: Multi-Density DBSCAN cluster Based on Grid Page 32