Client Based Power Iteration Clustering Algorithm to Reduce Dimensionality in Big Data

Similar documents
IMPLEMENTATION OF P-PIC ALGORITHM IN MAP REDUCE TO HANDLE BIG DATA

Advances in Natural and Applied Sciences

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Clustering Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

For supervised classification we have a variety of measures to evaluate how good our model is Accuracy, precision, recall

K-Means Cluster Analysis. Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1

Hadoop SNS. renren.com. Saturday, December 3, 11

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Enhancing Map-Reduce Framework for Bigdata with Hierarchical Clustering

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop

Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2

Research on Clustering Analysis of Big Data Yuan Yuanming 1, 2, a, Wu Chanle 1, 2

Accelerating BIRCH for Clustering Large Scale Streaming Data Using CUDA Dynamic Parallelism

Analysis of MapReduce Algorithms

Distributed Framework for Data Mining As a Service on Private Cloud

The Impact of Big Data on Classic Machine Learning Algorithms. Thomas Jensen, Senior Business Expedia

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU

Example: Document Clustering. Clustering: Definition. Notion of a Cluster can be Ambiguous. Types of Clusterings. Hierarchical Clustering

R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5

COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Big Data by the numbers

Analysis of Deflated Power Iteration Clustering on Hadoop Cluster

Hadoop and Map-Reduce. Swati Gore

HadoopRDF : A Scalable RDF Data Analysis System

Chapter 7. Using Hadoop Cluster and MapReduce

Data Mining Project Report. Document Clustering. Meryem Uzun-Per

COST MINIMIZATION OF RUNNING MAPREDUCE ACROSS GEOGRAPHICALLY DISTRIBUTED DATA CENTERS

Volume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies

Log Mining Based on Hadoop s Map and Reduce Technique

Hadoop Parallel Data Processing

Challenges for Data Driven Systems

Enhancing MapReduce Functionality for Optimizing Workloads on Data Centers

Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications

Map-Reduce for Machine Learning on Multicore

Text Mining Approach for Big Data Analysis Using Clustering and Classification Methodologies

From GWS to MapReduce: Google s Cloud Technology in the Early Days

Hadoop Architecture. Part 1

How To Handle Big Data With A Data Scientist

Cluster Analysis: Basic Concepts and Algorithms

DATA ANALYSIS II. Matrix Algorithms

Big Data with Rough Set Using Map- Reduce

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

RevoScaleR Speed and Scalability

Exploring the Efficiency of Big Data Processing with Hadoop MapReduce

Analysing Large Web Log Files in a Hadoop Distributed Cluster Environment

Finding Insights & Hadoop Cluster Performance Analysis over Census Dataset Using Big-Data Analytics

A Novel Way of Deduplication Approach for Cloud Backup Services Using Block Index Caching Technique

Parallel Algorithm for Dense Matrix Multiplication

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Hadoop Scheduler w i t h Deadline Constraint

Mining Interesting Medical Knowledge from Big Data

UPS battery remote monitoring system in cloud computing

International Journal of Emerging Technology & Research

Fast Matching of Binary Features

Map/Reduce Affinity Propagation Clustering Algorithm

Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques


A Parallel Algorithm for Enumerating Joint Weight of a Binary Linear Code in Network Coding

Big Data Systems CS 5965/6965 FALL 2015

Survey on Scheduling Algorithm in MapReduce Framework

Analyzing Log Files to Find Hit Count Through the Utilization of Hadoop MapReduce in Cloud Computing Environmen

Cellular Computing on a Linux Cluster

Survey on Load Rebalancing for Distributed File System in Cloud

Introduction to DISC and Hadoop

A Brief Outline on Bigdata Hadoop

ANALYSING THE FEATURES OF JAVA AND MAP/REDUCE ON HADOOP

Clustering. Clustering. What is Clustering? What is Clustering? What is Clustering? Types of Data in Cluster Analysis

Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce

Performance of Dynamic Load Balancing Algorithms for Unstructured Mesh Calculations

A Simultaneous Solution for General Linear Equations on a Ring or Hierarchical Cluster

A Serial Partitioning Approach to Scaling Graph-Based Knowledge Discovery

International Journal of Innovative Research in Information Security (IJIRIS) ISSN: (O) Volume 1 Issue 3 (September 2014)

Keywords Stock Exchange Market, Clustering, Hadoop, Map-Reduce

CURTAIL THE EXPENDITURE OF BIG DATA PROCESSING USING MIXED INTEGER NON-LINEAR PROGRAMMING

Large-scale Data Mining: MapReduce and Beyond Part 2: Algorithms. Spiros Papadimitriou, IBM Research Jimeng Sun, IBM Research Rong Yan, Facebook

NoSQL and Hadoop Technologies On Oracle Cloud

ISSN: Keywords: HDFS, Replication, Map-Reduce I Introduction:

Neural Network Software for Dam-Reservoir-Foundation Interaction

International Journal of Engineering Research ISSN: & Management Technology November-2015 Volume 2, Issue-6

Adaptive Framework for Network Traffic Classification using Dimensionality Reduction and Clustering

Fault Tolerance in Hadoop for Work Migration

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

International Journal of Innovative Research in Computer and Communication Engineering

Constant time median filtering of extra large images using Hadoop

Improving Apriori Algorithm to get better performance with Cloud Computing

SEMANTIC WEB BASED INFERENCE MODEL FOR LARGE SCALE ONTOLOGIES FROM BIG DATA

Cluster Analysis Overview. Data Mining Techniques: Cluster Analysis. What is Cluster Analysis? What is Cluster Analysis?

Bisecting K-Means for Clustering Web Log data

Big Systems, Big Data

Using Big Data and GIS to Model Aviation Fuel Burn

Big Data Processing with Google s MapReduce. Alexandru Costan

Recognization of Satellite Images of Large Scale Data Based On Map- Reduce Framework

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

Transcription:

Client Based Power Iteration Clustering Algorithm to Reduce Dimensionalit in Big Data Jaalatchum. D 1, Thambidurai. P 1, Department of CSE, PKIET, Karaikal, India Abstract - Clustering is a group of objects that are similar among themselves but dissimilar to objects in other clusters. Clustering large dataset is a challenging task and the need for increase in scalabilit and performance formulates it to use parallelism. Though the use of Big Data has become ver essential, analzing it is demanding. This paper presents the (pc-pic) parallel Client based Power Iteration clustering algorithm based on parallel PIC originated from PIC (Power Iteration Clustering). PIC performs clustering b embedding data points in a low dimensional data derived from the similarit matri. In this paper we have proposed a client based algorithm pc-pic that out performs the job done b the server and reduces its eecution time. The eperimental results show that pc-pic can perform well for big data. It s fast and scalable. The result also shows that the accurac in producing the clusters is almost similar to the original algorithm. Hence the results produced b pc-pic are fast, scalable and accurate. Kewords: PIC, p-pic, pc-pic, Big Data, Clustering. 1 Introduction Clustering is the process of grouping a set of phsical or abstract objects into classes of similar objects. Surve papers, e.g., [1][] provide a good reference on clustering methods. Sequential clustering algorithms work well for the data size that is less than thousands of data sets. However, the data size has been growing up ver fast in the past decade due to the rapid improvement of the information observation technolog. The characteristics volume, velocit and variet are referred to as big data b IBM. Big data is used to solve the challenge that doesn t fit into conventional relational database for handling them. The techniques to efficientl process and analze became a major research issue in recent ears. One common strateg to handle the problem is to parallelize the algorithms and to eecute them along with the input data on high-performance computers. Compared to man other clustering approaches, the major advantage of the graph-based approach is that the users do not need to know the number of clusters in advance. It does not require labeling data or assuming number of data clusters in advance. The major problem of the graph-based approach is that it requires large memor space and computational time while computing the graph structure [5]. The limitation comes from computing the similarit values of all pairs of the data nodes Moreover, all the pairs must be sorted or partiall sorted since the construction of the graph structure must retrieve the most similar pair of the data nodes. This step is logicall sequential and thus hard to be parallelized. Unfortunatel, this step is necessar and it takes most of the computational time and memor space while performing clustering. Therefore, to parallelize the graph-based approach is ver challenging. One popular modern clustering algorithm is spectral clustering. Spectral clustering is a famil of methods based on Eigen decompositions of affinit, dissimilarit or kernel matrices [5][7]. PIC replaces the Eigen decomposition needed b spectral clustering with matri vector multiplications, which can reduce computational compleit. B performing clustering on several datasets it has been proved that PIC [5] is not onl accurate but also fast. The PIC algorithm can handle large data s but fitting the similarit matri into the computer s memor is

not feasible. For these reasons we move on to parallelism across different machines. Due to its efficienc and performance for data communications in distributed cluster environments, the work was done on MPI as the programming model for implementing the parallel PIC algorithm Power Iteration Clustering Spectral clustering has its own advantages over other conventional algorithms like K-means and hierarchical clustering. The use computing eigenvector is time consuming [5]. Hence PIC is designed to find pseudo-eigenvector thus it can overcome the limitation. The effort required to compute the eigenvectors is relativel high, O(n3), where n is the number of data points. PIC [9] is not onl simple but is also scalable in terms of time compleit O(n) [5]. A pseudoeigenvector is not a member of the eigenvectors but is created linearl from them. Therefore, in the pseudo Eigen vector, if two data points lie in different clusters, their values can still be separated. Given a dataset X = ( 1 ;.. n ), a similarit function s( i ; j ) is a function where s( i ; j ) = s( j ; i ) and s >= if i j, and s = if i = j. An affinit matri A R nn is defined b A ij = s( i ; j ). The degree matri D associated with A is a diagonal matri with d ii = ij Aij: A normalized affinit matri W is defined as D -1 A. Thus the second-smallest, third-smallest,..., k th smallest eigenvectors of L are often well-suited for clustering the graph W into k components[1]. The main steps of Power iteration clustering algorithm are described as follows [9] [5]: 1) Calculate the similarit matri of the given graph. ) Normalize the calculated similarit matri of the graph, W=D -1 A. 3) Create the affinit matri A R n*n W from the normalized matri, obtained b calculating the similarit matri. ) Perform iterative matri vector multiplication is done V t+1 5) Cluster on the final vectors obtained. 6) Output the clustered vectors. Input: A data set ={ 1,,. n } and normalized affinit matri W //Affinit matri calculations and normalization 1. Construct the affinit matri from the given graph, A R n*n. Normalize the affinit matri b dividing each element b its row sum, W=D -1 A. //Steps for iterative matri- vector multiplication repeat The affinit matri is, V t+1 = (WV t )/ WV t 1 σ t+1 = V t+1 -V t Acceleration= σ t+1 -σ t Increase t Until stopping criteria is met. //Steps to generate initial vector Generate initial vector, V o =R/ R 1 where R is the row sum of W. Steps for clustering Cluster on the final vector V t. Output the clusters. Fig 1: Pseudo code for PIC 5 3 3 5 The normalized row sum for the first row is.9 W=.8...3.8..3..3 The row sum R is calculated as, 1.5 1.6 1. Hence the normalized matri is R =.1. The value of V o is calculated as.1.when t=, V 1 is obtained from the following V 1 =WV / WV Hence, V 1 =.65.61.

z z After all the necessar calculations, V 1 and V after substitution produce zero, hence we conclude that the number of clusters produced is two. The eperimental result of implementation of PIC algorithm for various input and various clusters generated for the given inputs are shown as graphs in the fig given below. Step : Calculate the row sum, R i, of the submatri and send it to master. Step 5: Normalize sub-matri b the row sum, W i = D i -1 A i. Step 6: Receive the initial vector from the master, v t-1. Step 7: Obtain sub-vector b performing matrivector multiplication, v i t = W i v t-1. Step 8: Send the sub-vector, v i t, to master. INPUT: [3 1 6;8 6 1;6 3 11] [1 1 3;6 5 6;7 8 9] 1-3.3. -.1-6 -8 -.1-1 -. -1 1 3 5 6 -.3 1 3 5 6 1-3 1-3 6 - -6 - -8 3 1 1 1.5.5 3 3 1 1 3 5 Fig.The graph for various inputs 3 Parallel PIC (p-pic) Parallelization is a method to improve performance and achieve scalabilit. Man techniques have been used to distribute the load over various processors. There are several different parallel programming frameworks available [1]. The message passing interface (MPI) is a message passing librar interface for performing communications in parallel programming environments [1]. Because of the efficienc and performance on a distributed environment, work has been done on MPI [5] as the programming model and implemented the parallel PIC algorithm. The algorithm for parallel PIC is as follows [5] and the flowchart is shown in fig 3. Step 1: Get the starting and end indices of cases from master processor. Step : Read in the chunk of case data and also get a case broadcasted from master. Step 3: Calculate similarit sub-matri, A i,a n/p b n matri. Fig 3: Flowchart for p-pic using MPI Client Based p-pic The time taken for transferring the data from the server to client takes much of the time for eecution. Since initial vectors has to be calculated and sent to the master each time to find the vectors, it consume more time. To reduce this process time the algorithm is designed in such a wa that the client takes the responsibilit of handling much work reducing work of server. The algorithm of client based power iteration clustering is as follows. The master receives the data from the dataset. The data are spilt based on the number of slaves (n). On receiving the data from the master, each slave starts its work of computation. Each slave receives the data file from the master and finds the row sum.the row sum is sent back to the master. Now the master finds the initial vector which is sent to the slave. The calculation of initial vector and number of clusters is calculated and the process ends when the stopping criteria is met. The architecture for the parallel client based PIC flowchart is given below in fig.

Figure: DESIGN OF PROPOSED WORK Social feeds Images Email Market Research Bigdata engine data node Master receives data file and split and send them to slave Slave receives data file and find row sum to send to master Master finds overall initial vector and send them to slave Slave creates the vector and sends them to master shows the time eecuted for various size of datasets. The data sizes are measured in MB and the time taken is calculated in milliseconds. The graph gives a comparison of PIC, p-pic and p-pic in MapReduce framework. Fig 7 gives the comparison of various datasets and its corresponding eecution time. No If stop criterion met es Master receives clusters from slaves In slave, calculate stop criterion Slave stopped After receiving all clusters from all slaves the master will be stopped Fig : Working of pc-pic algorithm 5 Eperimental Results The effectiveness of the original PIC for clustering has been discussed b Lin and Cohen [9]. The scalabilit of p-pic have been shown b Weizhong Yana [5]. In this paper will focus on scalabilit in parallel implementation of the pc-pic algorithm. We implemented our algorithm over a number of snthetic dataset of man records. We also created a data generator to produce a dataset used in our eperiment. The size (n) of the dataset varies from 1 to 1 numbers of rows. We performed the eperiment on local cluster. Our local cluster is HP Intel based and the number of nodes is 6. The stopping criteria for the number of clusters created are approimatel equal. In this paper we used speed up (eecution time) as the performance measure for implementing the pc-pic. 6 Comparisons of PIC, P-PIC and Pc-PIC We present performance results of pc-pic in terms of speedups on different no of processors and the scalabilit of algorithm with different database size are found.we compare p-pic with pc-pic and have shown that the performance have been increased along with the scalabilit. Fig 5and 6 Fig 5: Comparison of Speedup for PIC, p-pic and p-pic in MapReduce Fig 6: Data (MB) vs Eecution time (ms) Dataset Size PIC p-pic Pc-PIC (KB) (ms) (ms) (ms) 1 1 67 18 397 1189 385 3 5838 159 399 7879 176 6 Fig 7: Comparison of PIC, p-pic and pc-pic

7 Conclusion In this paper we have designed a new client based algorithm for PIC namel the pc-pic and have generated cluster for dataset of various size. The results have been compared with sequential PIC and p-pic using MPI. The results show that the clusters formed using pc-pic is almost same as that of the other algorithms. The performance has been increased to a greater etend b reducing the eecution time. It has also been observed the performance increases along with the increase in the data size. Hence it is more efficient for high dimensional dataset. 8 Future Work Detecting the failure node that crashes the entire sstem is necessar. The aim of fault tolerance sstem is to remove such nodes which cause failures in the sstem [8]. Using Hadoop the problem of fault tolerance can be avoided. As a future work we can address how node failures can be avoided using Mapreduce and can be compared with other frameworks. Hadoop is fault tolerant and it also provides a mechanism to overcome it. 9 References [1] Jain, M. Murt, and P. Flnn, Data clustering: A review, ACM Computing Surves 31 (3) (1999) 6 33. [] Xu, R. and Wunsch, D. Surve of clustering algorithms, IEEE Transactions on Neural Networks 16 (3) (5). Algorithms for High-Performance Distributed Data Mining. Volume 73, Issue 3, March 13 [6] W. Zhao, H. Ma, Q. He, Parallel K-means clustering based on MapReduce, Journal of Cloud Computing 5931 (9) 67 679. [7] H. Gao, J. Jiang, L. She, Y. Fu, A new agglomerative hierarchical clustering algorithm implementation based on the map reduce framework, International Journal of Digital Content Technolog and its Applications (3) (1) 95 1. [8] Kuseok Shim, MapReduce Algorithms for Big Data Analsis, Proceedings of the VLDB Endowment, Vol. 5, No. 1,Copright 1 VLDB Endowment 15897/1/8. [9] Frank Lin frank, William W., Power Iteration Clustering,International Conference on Machine Learning, Haifa, Israel, 1. [1] F. Lin, W.W. Cohen, Power iteration clustering, in: Proceedings of the 7th International Conference on Machine Learning, Haifa, Israel, 1. [11] Sakai, T. and Imia, A. Fast spectral clustering with random projection and sampling, Lecture Notes in Computer Science 563 (9) 37 38. [1] Quinn, M.J. Parallel Programming in C with MPI and OpenMP, McGraw-Hill,Boston, Mass, UA, 8 [3] Kuseok Shim, MapReduce Algorithms for Big Data Analsis, Proceedings of the VLDB Endowment, Vol. 5, No. 1,Copright 1 VLDB Endowment 15897/1/8. [] Chen,W.Y, Song, Y, Bai,H. and Lin. C, Parallel spectral clustering in distributed sstems, IEEE Transactions on Pattern Analsis and Machine Intelligence 33 (3) (11) 568 586 [5] Weizhong Yana et.al, P-PIC: Parallel power iteration clustering for big data, Models and