Analysis of Deflated Power Iteration Clustering on Hadoop Cluster
|
|
|
- Jasper Cross
- 9 years ago
- Views:
Transcription
1 I J C T A, 9(3), 2016, pp International Science Press Analysis of Deflated Power Iteration Clustering on Hadoop Cluster Jayalatchumy Dhanapal*, Thambidurai Perumal** Abstract: Big Data has to be stored and processed efficiently to extract knowledge and information from them. Scalability and performance becomes an important issue when dealing with large data intensive computing. There is also a need to address the failure and improve the effectiveness of existing machine learning algorithm. Mapeduce is a simplified programming model used for data intensive distributed parallel computing. This motivates us to redesign and convert the existing sequential algorithm to Mapeduce algorithms. One such unsupervised sequential algorithm is Deflated Power Iteration Clustering algorithm (DPIC) which uses Schurz deflation technique to compute multiple orthogonal pseudo eigen vectors. Finally they are clustered using K-means. This paper discusses about the implementation of DPIC algorithm over a distributed environment using Apache Hadoop. The experimental results show that DPIC algorithm achieves better performance in Mapeduce framework when handling larger datasets. Also, they are proved to be effective and accurate. Keywords: Mapeduce, BigData, Hadoop, Power-Iteration, Mapeduce, Performance, Deflation. 1. INTODUCTION The data volume is scaling faster than computing resources. Hence managing large datasets is a challenging task. The larger the dataset longer is the time taken for computation. Despite of algorithmic improvements proposed in many serial algorithms, they could not handle large scale problems further slowing down the process [11][12]. It is therefore a growing need to develop efficient parallel data mining algorithms that can run on a distributed system.that leads to powerful platforms. Distributed computing is a technique to solve the computational problems mainly by sharing the computation over network. The first such attempt was made by Google. e.g., Hadoop Mapeduce. Numerous notable attempts have been initiated to exploit massive parallel processing architectures as reported in [9]. Cluster analysis is the most important of the data mining techniques. Though traditional clustering methods are still popularly used, they still suffer from curse of dimensionality [26]. Motivated by the need for parallelism and effectiveness in this paper, implementation of the DPIC on Mapeduce is performed. hile the main effort in this paper is on parallelization strategies and implementation details for minimizing computation and communication costs, additional work has been done on the Mapeduce framework. The rest of the paper is organized as follows. Section II gives a literature review on various clustering methodologies. In Section III we discuss about the deflated PIC algorithm followed by the implementation of DPIC algorithm in section IV. In section V and VI, the mapper/reducer algorithms and its experimental results are shown. Finally section VII concludes the work giving its future directions. 2. LITEATUE EVIE There have been many studies conducted on various clustering methodologies [6]. The computational time and memory space occupied by spectral clustering algorithms is very large as there is a need of similarity measure calculations [1][2]. Fowleks et.al [13] proposed new method Nystrom to avoid the calculation of * Department of CSE/PKIET Pondicherry University. India, [email protected] ** Thambidurai Perumal Department of CSE/PKIET Pondicherry University. India, [email protected]
2 18 Jayalatchumy Dhanapal, Thambidurai Perumal similarity matrices. Dhillion et.al [14] proposed a method that does not use eigen vector. The advantage with these algorithms is that they reduce the computational time but the accuracy ratio and memory usage have not been addressed. To overcome these drawbacks Lin and Cohen [5] proposed a clustering algorithm based on power iteration which replaces the Eigen value decomposition by matrix vector multiplication. In terms of parallel framework many algorithms have been implemented in MPI, a message passing interface [4][15]. Yang [28] used this MPI method to distribute data. MPI is usually used to exploit parallelism across various nodes. The MPI mechanism increases the communication between machine and the network. MPI is bound in terms of performance when working with large dataset. eizhong Yan et.al [4] have chosen MPI as the programming model for implementing the parallel PIC algorithm since it is efficient and performs well for communicating in distributed clusters[27]. More recently Mapeduce [17] a Google parallel computing framework and its Java version Hadoop have been increasingly used [25]. eizhong Zhao et.al have proposed Parallel K-means algorithm based on Mapeduce that can scale well and efficiently to process large datasets on commodity hardware[10][21]. Qing Liao et.al [22] have proposed an Improved K-means based on Mapeduce which improve the performance of traditional ones by decreasing the number of iterations and accelerating processing speed per iteration. To overcome the efficiencies of existing algorithms for parallel matrix multiplication, Jian-Hua Zheng et.al [20] have presented a processing scheme based on VLC. Xiaoli Cu et.al[23] has proposed a novel processing model in Mapeduce to obtain high performance by eliminating the iteration dependency. Mapeduce takes advantage of local storage to avoid these issues while working with large datasets [16]. Hadoop supports fault tolerance and hence in this paper the implementation is carried on a parallel framework Mapeduce. 3. DEFLATED POE ITEATIVECLUSTEING Nowadays, spectral clustering algorithm is an active research areas in machine learning. Deflated Power Iterative Clustering is one among the spectral clustering algorithm which performs well compared to other traditional clustering algorithm like K-Means. The best method for computing the larger eigen vector is the Power Iteration. In this method, the vectors are initiated randomly and it is denoted by v t. The power method is given as v t = v t 1 where, is the matrix and is a normalizing constant which controls the vector from becoming too large or small [7]. The speed of computing the first pseudo eigenvector is the same for every pseudo eigen vector. Anh Pham The et.al has proposed a sequential method to find k mutually orthogonal pseudo-eigen vectors, where k is the number of clusters. Instead of finding one pseudo eigen vector like Power Iterative Clustering, Deflated-Power Iterative Clustering (DPIC) finds additional pseudo eigen vector. Each pseudo eigen vector contain additional useful information for clustering. Deflation is a technique to manipulate the system. After finding the largest eigen value displace it in such a way that the next larger value is the largest value in the system and apply power method. The eildnet deflation is a single vector technique. Schur decomposition is a technique where the basic idea is if one vector of 2 norm is know it can be completed by (n-1) additional vector to form the orthogonal basis of C n. Similarly various studies have been done on various deflation techniques like eildnet deflation; Hotelling deflation, Projection deflation, and Schur complement deflation. Of all these techniques it has been found that the only methods that find the orthogonality of pseudo eigen vectors is Schur complement deflation. In DPIC the original affinity matrix has eigen vectors (x 1, x 3.x n ) with eigen values
3 Analysis of Deflated Power Iteration Clustering on Hadoop Cluster 19 ( 1, 2, n ). First, from the original affinity matrix 0, they use power iteration as PIC to find the first pseudo-eigenvector v 1. The algorithm for DPIC is shown in Figure 1. On applying deflation the effect of v 1 is eliminated on 0 to obtain the second affinity matrix 1. From 1, power iteration is applied again to obtain the second pseudo-eigenvector v 2. The effect of v 2 on 1 is eliminated again using schurs deflation. This process is repeated k times. Due to multiple pseudo eigen vector computation, DPIC requires more execution time than PIC. This observation gave an idea to implement the DPIC algorithm using Mapeduce Input: A dataset X = {x 1,..x n } Step 1. ead the input and construct the similarity matrix. Step 2. Normalise the obtained matrix by dividing each element by its row sum Step 3. Compute the initial vector 0 v Step 4. Calculate the sub vector v t = v t 1 until the stopping criteria is met//deflation technique Step 5. Find the k th vector and then remove the effect of v 1 on by Schur deflation using l l1 l1 v/ vl wl1 l1 Step 6. Perform iteration by incrementing the value until the condition satisfies. Step 7. Use K-Mean to cluster point v t. Output: Cluster C 1, C 2,.C k Figure 1: Algorithm for DPIC 4. MAPEDUCE OPEATING PINCIPLE Google introduced Hadoop, a technique to analyze, manipulate and visualize Big Data[18][19]. Hadoop is a software framework that consists of Hadoop kernel, Mapeduce, HDFS and other components [8]. The Mapeduce has two phases, Map and educe. Users specify a map function that processes a <Key/Value> pair to generate a set of intermediate <Key/Value> pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. During processing the HDFS splits the huge data into smaller blocks called the chunks which are 64 MB by default. The number of maps is usually driven by the total size of the inputs, that is, the total number of blocks of the input files [13][24]. The data is split in <Key, Value> pair and the Map function is invoked for every <Key, Value> pair. The output is stored in a buffer. hen the buffer gets its threshold size the contents are partitioned and written to the disk. A memory sort is performed and the results are given to the combiner before which the files are merged into a single file and compressed accordingly. As soon as the map is completed the output is copied to buffer of the reduce task. hen the threshold level is reached the outputs are merged and written to the disk [8]. The map does the sorting; the output is copied and merged. During reduced phase the reduce function is called for each key of the output that is sorted and written to the HDFS. The execution model of Mapeduce is shown in the Figure DPIC IN MAPEDUCE The first step is to design the Mapeduce function to handle input/output. The input is given as <Key, Value> pair where Key is the cluster and <Value> is the vector in the dataset. The dataset are stored in the
4 20 Jayalatchumy Dhanapal, Thambidurai Perumal Figure 2: Mapeduce Execution Model Input: Normalized Similarity matrix Output: Final vectors Procedure DPIC EDUCEDESIGN NE vectors COMBINE resultant normalized sub matrix from MAP CLASS Generate the initial vector 0 where is the row sum of. Find v 1 from l-1 emove the effect of v1 on w l-1 using t t1 v VV V V t1 i j l1 c l1 t Calculate new vector and new velocity v t On increment, check for the convergence and cluster results ITE cluster file to output DIECTOY End Procedure Figure 3: Algorithm for DPIC Mapper Design Input: Datasets{x 1 x n }, splited into sub datasets such as split 1, split 2 split n, The sub datasets form <Key, Value> lists. These<Key, Value>lists are theinput to the map function. Output: Normalized Similarity matrix. Procedure DPIC MAPDESIGN Load cluster datasets. 1. 1? {x 1,.. x n } 2. Initial vector? V 0 3. S( xi, x j ) exp CEATE output record <A> 8. Call DPIC reduce End Procedure Figure 4: Algorithm for DPIC educer Design x i x 4. Construct the similarity matrix A? n*n 5. Normalize the matrix using A( i, j) 6. A( i, j) sum[ k 1] N j 2 2 input directory of the HDFS and this forms the <Key> field in the <Key, Value> pair. The commands need to compute the similarity between the dataset is assigned to the mapper function. The mapper is structured in such a way that it computes similarity and performs normalization. Once the normalization is processed they are assigned to the reducer. The matrix value computation is done and the vectors are generated. Now the effect of the vector is removed from. Next the cluster is updated and new vector and velocity is rewritten to the disk which is ready for the next iteration. After
5 Analysis of Deflated Power Iteration Clustering on Hadoop Cluster 21 understanding the mapper and the reducer function the algorithm for mapper/reducer is designed and shown in Figure 3 and Figure EXPEIMENTAL ESULTS In this section the performance of the deflated PIC in Mapeduce environment is evaluated with respect to the speedup, scalability and efficiency. The experiment was conducted on a cluster of computers each of which has two, 2.40 GH cores and 4 GB of memory. The Hadoop version is , java are used in Mapeduce systems. The data are executed on real datasets with 2, 4,6 and 8 nodes respectively in single machine environment and Mapeduce environment and their execution time are shown in Figure 5 and Figure 6. Test of speedup ratio: The speedup is an initial scale for consideration as it is an important key term to calculate the execution time and improvement in performance. It is used to calculate how many times a parallel algorithm works faster than a serial algorithm. The greater the ratio of speedup, lesser is the time that the parallel processors exploit [3]. This is the reason for parallelization of a sequence algorithm. The speedup is calculated using the formula Ts S Tp here T s is the execution time of the fastest sequential program and Tp is the execution time of the parallel program. If a parallel program is executed on (p) processor, the highest value is equal to number of processors [3]. In this system every processor needs T s /p time of complete the job. T S T s p p P The result of speedup ratio performance tests according to the various synthetic datasets are shown in Figure 7. It is clearly seen that as the value of the speedup increases the execution time decreases. In the graph the dataset of different size for different nodes are given along the x axis and its execution time in ms is given along the y axis. Since the time taken for execution of the algorithm decreases we can conclude that there is an increase in speed up. From the figure it is seen that the speedup ratio and the performance is improving drastically as the data size grows (i.e.) as the data increases the speedup performance also increases. hen the number of nodes increases it narrows the speedup [3]. It can be illustrated that the efficiency of DPIC in Hadoop platform is greater. The figure shows that the execution time of DPIC is lesser. Hence speedup is more. According to Amdahl s law [3], it is difficult to obtain the speedup of a parallel system with many processors because each program has a fraction á that cannot be parallelized, only the rest of (1-á) gets parallel. So the speedup for parallel program is Ts Ts. Ts(1 ) P S Tp P.( P 1) 1 Analysis of Scalability: The parallel efficiency of an algorithm is the speedup of the processor during parallel program execution. The formula [3] to calculate is Efficiency (E) = S /p. where S represents the ratio of speedup, p represents the number of processors in the cluster.
6 22 Jayalatchumy Dhanapal, Thambidurai Perumal Figure 5: Execution time in Single Machine Environment Figure 6: Execution time in Mapeduce Environment Figure 7: Speedup results for DPIC algorithm different datasets with varying nodes. Figure 8: Efficiency curve for DPIC algorithm for different datasets with varying nodes. Figure 8 shows the efficiency of DPIC for different datasets. From the figure it is seen that [3] as the data size increases, the efficiency is greater, (i.e.) it has better scalability. The graph for relative efficiency of the algorithm for different synthetic datasets from UCI machine repository has been shown below. 7. CONCLUSION In this study we have applied Mapeduce to DPIC algorithm and clustered over large data points of varying synthetic datasets. The algorithm shows an improvement in performance. The computational cost and time complexity has been much reduced using Mapeduce. It is noticed that increasing the number of nodes in the mapper affects the performance (i.e.) more the number of nodes better is the performance, Thus we can conclude that Mapeduce algorithm is highly scalable and is best suited for clustering of Big Data. The work has to be validated in real time increasing the size of datasets. The performance of the algorithm is to be further investigated in cloud. eferences [1] C. Fowlkes, S. Belongie, F. Chung, J. Malik, Spectral grouping using the Nystrom method, IEEE Transactions on Pattern Analysis and Machine Intelligence vol 26.Issue 2. January [2].Y. Chen, Y. Song, H. Bai, C. Lin, E.Y. Chang, Parallel spectral clustering in distributed systems, IEEE Transactions on Pattern Analysis and Machine Intelligence vol 33. Issue [3] Felician ALECU, Performance Analysis of Parallel Algorithms, Journal of Applied Quantitative methods vol (2) Issue 1.Spring [4] eizhong Yana et al., p-pic: Parallel power iteration clustering for big data, Models and Algorithms for High Performance Distributed Data Mining. Volume 73, Issue [5] Frank Lin frank, illiam, Power Iteration Clustering, International Conference on Machine Learning, Haifa, Israel. June [6] Kyuseok Shim, Mapeduce Algorithms for Big Data Analysis, Proceedings of the VLDB Endowment, Vol. 5, No. 12, Copyright 2012 VLDB Endowment /12/ [7] [8] an Jin, Chunhai Kou, uijuan Liu and Yefeng Li, Efficient Parallel Spectral Clustering Algorithm design for large datasets under Cloud computing, Journal of Cloud computing: Advances, Systems and Applications 2:
7 Analysis of Deflated Power Iteration Clustering on Hadoop Cluster 23 [9] Niu XZ, She. K, Study of fast parallel clustering partition algorithm for large dataset. ComputSci 39; [10] Zhao Z, Ma HF, Shi ZZ, esearch on Parallel K means algorithm design based on Hadoop platform, ComputSci 38; [11] Unren Che, Mejdl Safran, and Zhiyong Peng, From Big Data to Big Data Mining: Challenges, Issues, and Opportunities, DASFAA orkshops 2013, LNCS 7827, pp [12] M. Ester, H.P. Kriegel, J. Sander, X.. Xu, A density-based algorithm for discovering clusters in large spatial databases with noise, in proceedings of the Second International Conference on Knowledge Discovery and Data Mining, KDD-96, AAAI Press, pp [13] [14] S Dhillion, Y. Guan, and B. Kulls, eighted graph cuts without Eigen vector: a multilevel approach. IEEE trans. On Pattern analysis and machine intelligence vol 29. Issue [15] Hitesh Ninama, Distributed data mining using Message Passing Interface eview of esearch [ X] vol: 2 iss: 9. [16] Ellis H. ilson III and Mahmut T. Kandemir, Garth Gibson, Exploring Big Data Computational Traditional HPC NAS Storage, International conference on distributed computing systems,2014. [17] J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. Communication. ACM, 51(1): ,2008 [18] [19] [20] Jian-Hua Zheng, Liang-Jie Zhang, ong Zhu, Ke Ning and Dong Liu, Parallel Matrix Multiplication Algorithm Based on Vector Linear Combination Using Mapeduce, IEEE Ninth orld Congress on Services, /13 IEEE DOI /SEVICES [21] eizhong Zhao, Huifang Ma, and Qing He, Parallel K-Means Clustering Based on Mapeduce, CloudCom, LNCS 5931, pp , Springer-Verlag Berlin Heidelberg.2009 [22] Qing Liao, Fan Yang, Jingming Zhao, An Improved parallel K-means Clustering Algorithm with Mapeduce, Proceedings of ICCT2013, /13, IEEE.2013 [23] Xiaoli Cui Pingfei Zhu Xin Yang Keqiu Li Changqing Ji, Optimized big data K-means clustering using Mapeduce, Springer Science+Business Media New York, Supercomput, DOI /s [24] [25] Yuan Luo and Beth Plale, Hierarchical Mapeduce Programming Model and Scheduling Algorithms, Cluster, Cloud and Grid Computing (CCGrid), 12th IEEE/ACM International Symposium doi /ccgrid , [26] Michael Steinbach, LeventErtöz, and Vipin Kumar(2004), The Challenges of Clustering High Dimensional Data, New Directions in Statistical Physics, Book Part IV, Pages pp , doi / _16 [27] here to Deploy Your Hadoop Clusters? Accenture, June [28] Y. Ma, P. Zhang, Y.Cao and L. Guo.Parallel Auto-encoder for Efficient Outlier Detection. In Proceedings of the 2013.
Advances in Natural and Applied Sciences
AENSI Journals Advances in Natural and Applied Sciences ISSN:1995-0772 EISSN: 1998-1090 Journal home page: www.aensiweb.com/anas Clustering Algorithm Based On Hadoop for Big Data 1 Jayalatchumy D. and
Client Based Power Iteration Clustering Algorithm to Reduce Dimensionality in Big Data
Client Based Power Iteration Clustering Algorithm to Reduce Dimensionalit in Big Data Jaalatchum. D 1, Thambidurai. P 1, Department of CSE, PKIET, Karaikal, India Abstract - Clustering is a group of objects
IMPLEMENTATION OF P-PIC ALGORITHM IN MAP REDUCE TO HANDLE BIG DATA
IMPLEMENTATION OF P-PIC ALGORITHM IN MAP REDUCE TO HANDLE BIG DATA Jayalatchumy D 1, Thambidurai. P 2 Abstract Clustering is a process of grouping objects that are similar among themselves but dissimilar
Research on Clustering Analysis of Big Data Yuan Yuanming 1, 2, a, Wu Chanle 1, 2
Advanced Engineering Forum Vols. 6-7 (2012) pp 82-87 Online: 2012-09-26 (2012) Trans Tech Publications, Switzerland doi:10.4028/www.scientific.net/aef.6-7.82 Research on Clustering Analysis of Big Data
Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques
Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques Subhashree K 1, Prakash P S 2 1 Student, Kongu Engineering College, Perundurai, Erode 2 Assistant Professor,
Distributed Framework for Data Mining As a Service on Private Cloud
RESEARCH ARTICLE OPEN ACCESS Distributed Framework for Data Mining As a Service on Private Cloud Shraddha Masih *, Sanjay Tanwani** *Research Scholar & Associate Professor, School of Computer Science &
An Experimental Approach Towards Big Data for Analyzing Memory Utilization on a Hadoop cluster using HDFS and MapReduce.
An Experimental Approach Towards Big Data for Analyzing Memory Utilization on a Hadoop cluster using HDFS and MapReduce. Amrit Pal Stdt, Dept of Computer Engineering and Application, National Institute
Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2
Volume 6, Issue 3, March 2016 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Special Issue
Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications
Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications Ahmed Abdulhakim Al-Absi, Dae-Ki Kang and Myong-Jong Kim Abstract In Hadoop MapReduce distributed file system, as the input
International Journal of Advance Research in Computer Science and Management Studies
Volume 2, Issue 8, August 2014 ISSN: 2321 7782 (Online) International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online
Query and Analysis of Data on Electric Consumption Based on Hadoop
, pp.153-160 http://dx.doi.org/10.14257/ijdta.2016.9.2.17 Query and Analysis of Data on Electric Consumption Based on Hadoop Jianjun 1 Zhou and Yi Wu 2 1 Information Science and Technology in Heilongjiang
International Journal of Advanced Engineering Research and Applications (IJAERA) ISSN: 2454-2377 Vol. 1, Issue 6, October 2015. Big Data and Hadoop
ISSN: 2454-2377, October 2015 Big Data and Hadoop Simmi Bagga 1 Satinder Kaur 2 1 Assistant Professor, Sant Hira Dass Kanya MahaVidyalaya, Kala Sanghian, Distt Kpt. INDIA E-mail: [email protected]
Analysis and Modeling of MapReduce s Performance on Hadoop YARN
Analysis and Modeling of MapReduce s Performance on Hadoop YARN Qiuyi Tang Dept. of Mathematics and Computer Science Denison University [email protected] Dr. Thomas C. Bressoud Dept. of Mathematics and
Log Mining Based on Hadoop s Map and Reduce Technique
Log Mining Based on Hadoop s Map and Reduce Technique ABSTRACT: Anuja Pandit Department of Computer Science, [email protected] Amruta Deshpande Department of Computer Science, [email protected]
Finding Insights & Hadoop Cluster Performance Analysis over Census Dataset Using Big-Data Analytics
Finding Insights & Hadoop Cluster Performance Analysis over Census Dataset Using Big-Data Analytics Dharmendra Agawane 1, Rohit Pawar 2, Pavankumar Purohit 3, Gangadhar Agre 4 Guide: Prof. P B Jawade 2
Text Mining Approach for Big Data Analysis Using Clustering and Classification Methodologies
Text Mining Approach for Big Data Analysis Using Clustering and Classification Methodologies Somesh S Chavadi 1, Dr. Asha T 2 1 PG Student, 2 Professor, Department of Computer Science and Engineering,
Big Data with Rough Set Using Map- Reduce
Big Data with Rough Set Using Map- Reduce Mr.G.Lenin 1, Mr. A. Raj Ganesh 2, Mr. S. Vanarasan 3 Assistant Professor, Department of CSE, Podhigai College of Engineering & Technology, Tirupattur, Tamilnadu,
FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data
FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data Miguel Liroz-Gistau, Reza Akbarinia, Patrick Valduriez To cite this version: Miguel Liroz-Gistau, Reza Akbarinia, Patrick Valduriez. FP-Hadoop:
International Journal of Engineering Research ISSN: 2348-4039 & Management Technology November-2015 Volume 2, Issue-6
International Journal of Engineering Research ISSN: 2348-4039 & Management Technology Email: [email protected] November-2015 Volume 2, Issue-6 www.ijermt.org Modeling Big Data Characteristics for Discovering
Efficient Analysis of Big Data Using Map Reduce Framework
Efficient Analysis of Big Data Using Map Reduce Framework Dr. Siddaraju 1, Sowmya C L 2, Rashmi K 3, Rahul M 4 1 Professor & Head of Department of Computer Science & Engineering, 2,3,4 Assistant Professor,
Enhancing Map-Reduce Framework for Bigdata with Hierarchical Clustering
Enhancing Map-Reduce Framework for Bigdata with Hierarchical Clustering Vadivel.M 1, Raghunath.V 2 M.Tech-CSE, Department of CSE, B.S. Abdur Rahman Institute of Science and Technology, B.S. Abdur Rahman
Mining Large Datasets: Case of Mining Graph Data in the Cloud
Mining Large Datasets: Case of Mining Graph Data in the Cloud Sabeur Aridhi PhD in Computer Science with Laurent d Orazio, Mondher Maddouri and Engelbert Mephu Nguifo 16/05/2014 Sabeur Aridhi Mining Large
Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce
Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce Mohammad Farhan Husain, Pankil Doshi, Latifur Khan, and Bhavani Thuraisingham University of Texas at Dallas, Dallas TX 75080, USA Abstract.
Mining Interesting Medical Knowledge from Big Data
IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661,p-ISSN: 2278-8727, Volume 18, Issue 1, Ver. II (Jan Feb. 2016), PP 06-10 www.iosrjournals.org Mining Interesting Medical Knowledge from
Analyzing Log Files to Find Hit Count Through the Utilization of Hadoop MapReduce in Cloud Computing Environmen
Analyzing Log Files to Find Hit Count Through the Utilization of Hadoop MapReduce in Cloud Computing Environmen Anil G, 1* Aditya K Naik, 1 B C Puneet, 1 Gaurav V, 1 Supreeth S 1 Abstract: Log files which
Task Scheduling in Hadoop
Task Scheduling in Hadoop Sagar Mamdapure Munira Ginwala Neha Papat SAE,Kondhwa SAE,Kondhwa SAE,Kondhwa Abstract Hadoop is widely used for storing large datasets and processing them efficiently under distributed
Large-Scale Data Sets Clustering Based on MapReduce and Hadoop
Journal of Computational Information Systems 7: 16 (2011) 5956-5963 Available at http://www.jofcis.com Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Ping ZHOU, Jingsheng LEI, Wenjun YE
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A REVIEW ON HIGH PERFORMANCE DATA STORAGE ARCHITECTURE OF BIGDATA USING HDFS MS.
Survey on Scheduling Algorithm in MapReduce Framework
Survey on Scheduling Algorithm in MapReduce Framework Pravin P. Nimbalkar 1, Devendra P.Gadekar 2 1,2 Department of Computer Engineering, JSPM s Imperial College of Engineering and Research, Pune, India
An Analysis on Density Based Clustering of Multi Dimensional Spatial Data
An Analysis on Density Based Clustering of Multi Dimensional Spatial Data K. Mumtaz 1 Assistant Professor, Department of MCA Vivekanandha Institute of Information and Management Studies, Tiruchengode,
Map-Reduce for Machine Learning on Multicore
Map-Reduce for Machine Learning on Multicore Chu, et al. Problem The world is going multicore New computers - dual core to 12+-core Shift to more concurrent programming paradigms and languages Erlang,
MapReduce on GPUs. Amit Sabne, Ahmad Mujahid Mohammed Razip, Kun Xu
1 MapReduce on GPUs Amit Sabne, Ahmad Mujahid Mohammed Razip, Kun Xu 2 MapReduce MAP Shuffle Reduce 3 Hadoop Open-source MapReduce framework from Apache, written in Java Used by Yahoo!, Facebook, Ebay,
Chapter 7. Using Hadoop Cluster and MapReduce
Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in
Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU
Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU Heshan Li, Shaopeng Wang The Johns Hopkins University 3400 N. Charles Street Baltimore, Maryland 21218 {heshanli, shaopeng}@cs.jhu.edu 1 Overview
ANALYSING THE FEATURES OF JAVA AND MAP/REDUCE ON HADOOP
ANALYSING THE FEATURES OF JAVA AND MAP/REDUCE ON HADOOP Livjeet Kaur Research Student, Department of Computer Science, Punjabi University, Patiala, India Abstract In the present study, we have compared
BSPCloud: A Hybrid Programming Library for Cloud Computing *
BSPCloud: A Hybrid Programming Library for Cloud Computing * Xiaodong Liu, Weiqin Tong and Yan Hou Department of Computer Engineering and Science Shanghai University, Shanghai, China [email protected],
A Small-time Scale Netflow-based Anomaly Traffic Detecting Method Using MapReduce
, pp.231-242 http://dx.doi.org/10.14257/ijsia.2014.8.2.24 A Small-time Scale Netflow-based Anomaly Traffic Detecting Method Using MapReduce Wang Jin-Song, Zhang Long, Shi Kai and Zhang Hong-hao School
Map/Reduce Affinity Propagation Clustering Algorithm
Map/Reduce Affinity Propagation Clustering Algorithm Wei-Chih Hung, Chun-Yen Chu, and Yi-Leh Wu Department of Computer Science and Information Engineering, National Taiwan University of Science and Technology,
Introduction to Parallel Programming and MapReduce
Introduction to Parallel Programming and MapReduce Audience and Pre-Requisites This tutorial covers the basics of parallel programming and the MapReduce programming model. The pre-requisites are significant
Aggregation Methodology on Map Reduce for Big Data Applications by using Traffic-Aware Partition Algorithm
Aggregation Methodology on Map Reduce for Big Data Applications by using Traffic-Aware Partition Algorithm R. Dhanalakshmi 1, S.Mohamed Jakkariya 2, S. Mangaiarkarasi 3 PG Scholar, Dept. of CSE, Shanmugnathan
Bisecting K-Means for Clustering Web Log data
Bisecting K-Means for Clustering Web Log data Ruchika R. Patil Department of Computer Technology YCCE Nagpur, India Amreen Khan Department of Computer Technology YCCE Nagpur, India ABSTRACT Web usage mining
International Journal of Innovative Research in Computer and Communication Engineering
FP Tree Algorithm and Approaches in Big Data T.Rathika 1, J.Senthil Murugan 2 Assistant Professor, Department of CSE, SRM University, Ramapuram Campus, Chennai, Tamil Nadu,India 1 Assistant Professor,
Keywords: Big Data, HDFS, Map Reduce, Hadoop
Volume 5, Issue 7, July 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Configuration Tuning
ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU
Computer Science 14 (2) 2013 http://dx.doi.org/10.7494/csci.2013.14.2.243 Marcin Pietroń Pawe l Russek Kazimierz Wiatr ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU Abstract This paper presents
Reducer Load Balancing and Lazy Initialization in Map Reduce Environment S.Mohanapriya, P.Natesan
Reducer Load Balancing and Lazy Initialization in Map Reduce Environment S.Mohanapriya, P.Natesan Abstract Big Data is revolutionizing 21st-century with increasingly huge amounts of data to store and be
The Comprehensive Performance Rating for Hadoop Clusters on Cloud Computing Platform
The Comprehensive Performance Rating for Hadoop Clusters on Cloud Computing Platform Fong-Hao Liu, Ya-Ruei Liou, Hsiang-Fu Lo, Ko-Chin Chang, and Wei-Tsong Lee Abstract Virtualization platform solutions
A hybrid algorithm combining weighted and hasht apriori algorithms in Map Reduce model using Eucalyptus cloud platform
A hybrid algorithm combining weighted and hasht apriori algorithms in Map Reduce model using Eucalyptus cloud platform 1 R. SUMITHRA, 2 SUJNI PAUL AND 3 D. PONMARY PUSHPA LATHA 1 School of Computer Science,
Journal of Chemical and Pharmaceutical Research, 2015, 7(3):1388-1392. Research Article. E-commerce recommendation system on cloud computing
Available online www.jocpr.com Journal of Chemical and Pharmaceutical Research, 2015, 7(3):1388-1392 Research Article ISSN : 0975-7384 CODEN(USA) : JCPRC5 E-commerce recommendation system on cloud computing
ISSN:2320-0790. Keywords: HDFS, Replication, Map-Reduce I Introduction:
ISSN:2320-0790 Dynamic Data Replication for HPC Analytics Applications in Hadoop Ragupathi T 1, Sujaudeen N 2 1 PG Scholar, Department of CSE, SSN College of Engineering, Chennai, India 2 Assistant Professor,
Hadoop Technology for Flow Analysis of the Internet Traffic
Hadoop Technology for Flow Analysis of the Internet Traffic Rakshitha Kiran P PG Scholar, Dept. of C.S, Shree Devi Institute of Technology, Mangalore, Karnataka, India ABSTRACT: Flow analysis of the internet
A Survey on Outlier Detection Techniques for Credit Card Fraud Detection
IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661, p- ISSN: 2278-8727Volume 16, Issue 2, Ver. VI (Mar-Apr. 2014), PP 44-48 A Survey on Outlier Detection Techniques for Credit Card Fraud
Hadoop on a Low-Budget General Purpose HPC Cluster in Academia
Hadoop on a Low-Budget General Purpose HPC Cluster in Academia Paolo Garza, Paolo Margara, Nicolò Nepote, Luigi Grimaudo, and Elio Piccolo Dipartimento di Automatica e Informatica, Politecnico di Torino,
Analysing Large Web Log Files in a Hadoop Distributed Cluster Environment
Analysing Large Files in a Hadoop Distributed Cluster Environment S Saravanan, B Uma Maheswari Department of Computer Science and Engineering, Amrita School of Engineering, Amrita Vishwa Vidyapeetham,
Hadoop and Map-Reduce. Swati Gore
Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data
Scalable Multiple NameNodes Hadoop Cloud Storage System
Vol.8, No.1 (2015), pp.105-110 http://dx.doi.org/10.14257/ijdta.2015.8.1.12 Scalable Multiple NameNodes Hadoop Cloud Storage System Kun Bi 1 and Dezhi Han 1,2 1 College of Information Engineering, Shanghai
Performance Analysis of Apriori Algorithm with Different Data Structures on Hadoop Cluster
Performance Analysis of Apriori Algorithm with Different Data Structures on Hadoop Cluster Sudhakar Singh Dept. of Computer Science Faculty of Science Banaras Hindu University Rakhi Garg Dept. of Computer
Design of Electric Energy Acquisition System on Hadoop
, pp.47-54 http://dx.doi.org/10.14257/ijgdc.2015.8.5.04 Design of Electric Energy Acquisition System on Hadoop Yi Wu 1 and Jianjun Zhou 2 1 School of Information Science and Technology, Heilongjiang University
Grid Density Clustering Algorithm
Grid Density Clustering Algorithm Amandeep Kaur Mann 1, Navneet Kaur 2, Scholar, M.Tech (CSE), RIMT, Mandi Gobindgarh, Punjab, India 1 Assistant Professor (CSE), RIMT, Mandi Gobindgarh, Punjab, India 2
Graph Mining on Big Data System. Presented by Hefu Chai, Rui Zhang, Jian Fang
Graph Mining on Big Data System Presented by Hefu Chai, Rui Zhang, Jian Fang Outline * Overview * Approaches & Environment * Results * Observations * Notes * Conclusion Overview * What we have done? *
Analysis and Optimization of Massive Data Processing on High Performance Computing Architecture
Analysis and Optimization of Massive Data Processing on High Performance Computing Architecture He Huang, Shanshan Li, Xiaodong Yi, Feng Zhang, Xiangke Liao and Pan Dong School of Computer Science National
A Novel Way of Deduplication Approach for Cloud Backup Services Using Block Index Caching Technique
A Novel Way of Deduplication Approach for Cloud Backup Services Using Block Index Caching Technique Jyoti Malhotra 1,Priya Ghyare 2 Associate Professor, Dept. of Information Technology, MIT College of
From GWS to MapReduce: Google s Cloud Technology in the Early Days
Large-Scale Distributed Systems From GWS to MapReduce: Google s Cloud Technology in the Early Days Part II: MapReduce in a Datacenter COMP6511A Spring 2014 HKUST Lin Gu [email protected] MapReduce/Hadoop
Efficient Data Replication Scheme based on Hadoop Distributed File System
, pp. 177-186 http://dx.doi.org/10.14257/ijseia.2015.9.12.16 Efficient Data Replication Scheme based on Hadoop Distributed File System Jungha Lee 1, Jaehwa Chung 2 and Daewon Lee 3* 1 Division of Supercomputing,
Accelerating BIRCH for Clustering Large Scale Streaming Data Using CUDA Dynamic Parallelism
Accelerating BIRCH for Clustering Large Scale Streaming Data Using CUDA Dynamic Parallelism Jianqiang Dong, Fei Wang and Bo Yuan Intelligent Computing Lab, Division of Informatics Graduate School at Shenzhen,
Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel
Parallel Databases Increase performance by performing operations in parallel Parallel Architectures Shared memory Shared disk Shared nothing closely coupled loosely coupled Parallelism Terminology Speedup:
Big Data Analytics Hadoop and Spark
Big Data Analytics Hadoop and Spark Shelly Garion, Ph.D. IBM Research Haifa 1 What is Big Data? 2 What is Big Data? Big data usually includes data sets with sizes beyond the ability of commonly used software
Searching frequent itemsets by clustering data
Towards a parallel approach using MapReduce Maria Malek Hubert Kadima LARIS-EISTI Ave du Parc, 95011 Cergy-Pontoise, FRANCE [email protected], [email protected] 1 Introduction and Related Work
CSE-E5430 Scalable Cloud Computing Lecture 2
CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University [email protected] 14.9-2015 1/36 Google MapReduce A scalable batch processing
Reduction of Data at Namenode in HDFS using harballing Technique
Reduction of Data at Namenode in HDFS using harballing Technique Vaibhav Gopal Korat, Kumar Swamy Pamu [email protected] [email protected] Abstract HDFS stands for the Hadoop Distributed File System.
Big Data With Hadoop
With Saurabh Singh [email protected] The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials
Volume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies
Volume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online at: www.ijarcsms.com Image
An efficient Mapreduce scheduling algorithm in hadoop R.Thangaselvi 1, S.Ananthbabu 2, R.Aruna 3
An efficient Mapreduce scheduling algorithm in hadoop R.Thangaselvi 1, S.Ananthbabu 2, R.Aruna 3 1 M.E: Department of Computer Science, VV College of Engineering, Tirunelveli, India 2 Assistant Professor,
Storage and Retrieval of Data for Smart City using Hadoop
Storage and Retrieval of Data for Smart City using Hadoop Ravi Gehlot Department of Computer Science Poornima Institute of Engineering and Technology Jaipur, India Abstract Smart cities are equipped with
Hadoop Parallel Data Processing
MapReduce and Implementation Hadoop Parallel Data Processing Kai Shen A programming interface (two stage Map and Reduce) and system support such that: the interface is easy to program, and suitable for
Robust Outlier Detection Technique in Data Mining: A Univariate Approach
Robust Outlier Detection Technique in Data Mining: A Univariate Approach Singh Vijendra and Pathak Shivani Faculty of Engineering and Technology Mody Institute of Technology and Science Lakshmangarh, Sikar,
Parallel Computing. Benson Muite. [email protected] http://math.ut.ee/ benson. https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage
Parallel Computing Benson Muite [email protected] http://math.ut.ee/ benson https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage 3 November 2014 Hadoop, Review Hadoop Hadoop History Hadoop Framework
A computational model for MapReduce job flow
A computational model for MapReduce job flow Tommaso Di Noia, Marina Mongiello, Eugenio Di Sciascio Dipartimento di Ingegneria Elettrica e Dell informazione Politecnico di Bari Via E. Orabona, 4 70125
Evaluating HDFS I/O Performance on Virtualized Systems
Evaluating HDFS I/O Performance on Virtualized Systems Xin Tang [email protected] University of Wisconsin-Madison Department of Computer Sciences Abstract Hadoop as a Service (HaaS) has received increasing
A Survey of Classification Techniques in the Area of Big Data.
A Survey of Classification Techniques in the Area of Big Data. 1PrafulKoturwar, 2 SheetalGirase, 3 Debajyoti Mukhopadhyay 1Reseach Scholar, Department of Information Technology 2Assistance Professor,Department
Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data
Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give
Optimization of Distributed Crawler under Hadoop
MATEC Web of Conferences 22, 0202 9 ( 2015) DOI: 10.1051/ matecconf/ 2015220202 9 C Owned by the authors, published by EDP Sciences, 2015 Optimization of Distributed Crawler under Hadoop Xiaochen Zhang*
Optimization and analysis of large scale data sorting algorithm based on Hadoop
Optimization and analysis of large scale sorting algorithm based on Hadoop Zhuo Wang, Longlong Tian, Dianjie Guo, Xiaoming Jiang Institute of Information Engineering, Chinese Academy of Sciences {wangzhuo,
ultra fast SOM using CUDA
ultra fast SOM using CUDA SOM (Self-Organizing Map) is one of the most popular artificial neural network algorithms in the unsupervised learning category. Sijo Mathew Preetha Joy Sibi Rajendra Manoj A
Survey on Load Rebalancing for Distributed File System in Cloud
Survey on Load Rebalancing for Distributed File System in Cloud Prof. Pranalini S. Ketkar Ankita Bhimrao Patkure IT Department, DCOER, PG Scholar, Computer Department DCOER, Pune University Pune university
Constant time median filtering of extra large images using Hadoop
Proceedings of the 9 th International Conference on Applied Informatics Eger, Hungary, January 29 February 1, 2014. Vol. 1. pp. 93 101 doi: 10.14794/ICAI.9.2014.1.93 Constant time median filtering of extra
MapReduce and Hadoop Distributed File System
MapReduce and Hadoop Distributed File System 1 B. RAMAMURTHY Contact: Dr. Bina Ramamurthy CSE Department University at Buffalo (SUNY) [email protected] http://www.cse.buffalo.edu/faculty/bina Partially
Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related
Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Summary Xiangzhe Li Nowadays, there are more and more data everyday about everything. For instance, here are some of the astonishing
Big Application Execution on Cloud using Hadoop Distributed File System
Big Application Execution on Cloud using Hadoop Distributed File System Ashkan Vates*, Upendra, Muwafaq Rahi Ali RPIIT Campus, Bastara Karnal, Haryana, India ---------------------------------------------------------------------***---------------------------------------------------------------------
processed parallely over the cluster nodes. Mapreduce thus provides a distributed approach to solve complex and lengthy problems
Big Data Clustering Using Genetic Algorithm On Hadoop Mapreduce Nivranshu Hans, Sana Mahajan, SN Omkar Abstract: Cluster analysis is used to classify similar objects under same group. It is one of the
Classification On The Clouds Using MapReduce
Classification On The Clouds Using MapReduce Simão Martins Instituto Superior Técnico Lisbon, Portugal [email protected] Cláudia Antunes Instituto Superior Técnico Lisbon, Portugal [email protected]
http://www.wordle.net/
Hadoop & MapReduce http://www.wordle.net/ http://www.wordle.net/ Hadoop is an open-source software framework (or platform) for Reliable + Scalable + Distributed Storage/Computational unit Failures completely
Detection of Distributed Denial of Service Attack with Hadoop on Live Network
Detection of Distributed Denial of Service Attack with Hadoop on Live Network Suchita Korad 1, Shubhada Kadam 2, Prajakta Deore 3, Madhuri Jadhav 4, Prof.Rahul Patil 5 Students, Dept. of Computer, PCCOE,
