A Two-stage Algorithm for Outlier Detection Under Cloud Environment Based on Clustering and Perceptron Model

Size: px
Start display at page:

Download "A Two-stage Algorithm for Outlier Detection Under Cloud Environment Based on Clustering and Perceptron Model"

Transcription

1 Journal of Computational Information Systems 11: 20 (2015) Available at A Two-stage Algorithm for Outlier Detection Under Cloud Environment Based on Clustering and Perceptron Model Bing HU 1,, Fuxi ZHU 1, Lu LU 2, Rong GAO 1 1 School of Computer, Wuhan University, Wuhan , China 2 College of Computer Science and Technology, Shanghai University of Electric Power, Shanghai , China Abstract Outlier detection in large high dimensional datasets is an important and difficult branch of outlier detection research. Since large high dimensional datasets are characterized by sparse data distribution and high dimensional attribute, traditional distance-based outlier detection methods cease to be effective when dealing with massive data. And to solve the above mentioned problem, in this paper, we propose MR-ODCDCalgorithm that is a two-stage outlier detection algorithm based on both clustering and perceptron model. MR-ODCDC algorithm first obtains several micro-clustering components by clustering the dataset with K-means clustering algorithm, then mines local outliers from the objects in these microclustering components by adopting a method that combines the perceptron model and the distancebased outlier detection algorithm, and finally realizes parallelization and scalability using MapReduce programming model. Experimental results have shown that our algorithm has improved the efficiency of outlier detection technique to some extent. And the larger the dataset is, the more obvious effect our algorithm achieves. In the area of Web services, situational awareness technology is to receive the current trend of service by perceiving Web services. In situational awareness technology, there are two key algorithms, the trend-extraction algorithm and the trend-recognition algorithm. In these two algorithms, three key threshold parameters are proposed. If these three threshold parameters are set too large, the fitting accuracy will be reduced; if too small, it will lead to the concussion of fitting fragment. Therefore, the adaptive threshold setting algorithm by genetic algorithm is proposed in this dissertation. The results of simulation experiments show that this method is able to find the three optimal thresholds which have minimum fitting error and thus is better than the traditional fixed threshold setting method. Keywords: Outlier; MapReduce; Data Mining; Cell; Massive Data; Clustering Analysis 1 Introduction Outlier detection [1-3], also known as outlier mining, aims at selecting out from a large complex dataset a small part of abnormal data which is novel and different from the conventional ones. In Corresponding author. address: hubing @126.com (Fuxi ZHU) / Copyright 2015 Binary Information Press DOI: /jcis15944 October 15, 2015

2 7530 B. Hu et al. /Journal of Computational Information Systems 11: 20 (2015) practical applications, outliers can be classified into two categories: 1) One is brought about by human errors or measuring equipment failure, such as execution error or incorrect measurement. This kind of outliers is useless and is to be eliminate. 2) The other kind is caused by data itself. This kind of outliers are very interesting and usually carry useful information. They are the concern for outlier detection study. Outlier detection has found wide application in such fields as credit card fraud, loan application, network intrusion detection and customers classification, to name a few. As an important branch of data mining, in recent years, outlier detection has attracted a great deal of attention from many researchers and some promising results have been achieved [4]. There are numerous outlier detection algorithms. They can generally be divided into 5 categories: 1) distribution-based algorithms [5]; 2) depth-based algorithms [6]; 3) density-based algorithms [7]; 4) distance-based algorithms [8]; and 5) clustering-based algorithms [9]. In [8], the authors offer a general distance-based definition for outliers and propose an outlier detection method which can determine whether a data is an outlier without definitive data distribution available by adopting the K-distance neighborhood. But the time complexity of this algorithm is rather high and the mining result is very sensitive to the selection of the parameter. Moreover, the accuracy and timeliness of the detection result will be on decline as the size and dimensionality of the dataset increases. The study in [9] focuses on the discovery of clusters. It first clusters the dataset, and after clustering, the data points left will be labeled as outliers. This method is simple but is of low efficiency. Through the above analysis, we can see that in spite of those many existed outlier detection algorithms, none of them can effectively deal with massive data. And with the rapid development of parallel computing technologies, to improve the detection efficiency in massive data, a new distributed outlier detection algorithm which is based on cloud environment and combines cloud computing technology [10, 11] with outlier detection algorithm is in urgent need. In this paper, we present an outlier detection algorithm based on Hadoop cloud platform. Our algorithm combines the clustering analysis together with the cell-based approach in distance-based outlier detection method [12], and divides the whole algorithm into two subtasks by adopting the parallel programming model of sequential modular MapReduceputting forward by Google company. When detection, the MapReduce programming model is used to realized the subtasks one by one. The feasibility and availability of the proposed algorithm has already been verified in the standard UCI dataset. The experiments show that our algorithm works not only appropriately but also effectively when dealing with massive data. 2 Preliminaries 2.1 Hadoop cloud platform Hadoop cloud platform mainly consists of two parts. One is the Hadoop Distributed File System (HDFS), the other is the MapRedcue programming model framework. HDFS stores the files in all the storage nodes in Hadoop clusters, and it has three H properties, namely, high fault tolerance, high reliability and high throughput of data access, which make it appropriate for dealing with large datasets. MapReduce programming model is a simplified distributed model originally designed and introduced by Google. It is usually applied to the computation of massive data in a distributed

3 B. Hu et al. /Journal of Computational Information Systems 11: 20 (2015) parallel environment. MapReduce operation is mainly composed with Map operation and Reduce operation. First() divides the metadata files into several chunks in accordance with(). Then the Master server distributes different chunks to corresponding Worker servers for map. All the map operations are independent from each other and are highly parallel. They reads in the data in the form of < Key, V alue >, and writes the intermediate results to the local file system in the same form. Then the Master server adopts different Worker servers for Reduce operation. This operation combines and output the intermediate results written by the map operation in the form of < Key, V alue > User Program (7) (2)assign map Master (2)assign reduce split0 split1 split2 split3 HDFS worker worker worker worker worker (6)write Output file0 Output file1 Output file2 HDFS Input files Map phase Combiner Reduce phase Output files Fig. 1: MapReduce model 2.2 Perceptron A single-layer perceptron is a feed forward network which is equipped with a single-layer neuron and adopts the threshold to activate the function. By training the network weight, we can make the perceptron s response to a group of input vectors to become the target output whose value is 0 or 1 and then classify the input vectors. Fig. 2 shows the model diagram of the single-layer perceptron neuron p 1 w 1 p 2 w 2 p r w r Fig. 2: Model diagram of the single-layer perceptron neuron As shown in Fig. 2, we take the weighted summation of the input components p i (i = 1, 2,, r), which is achieved with a weight component w i (i = 1, 2,, r), as the input of the threshold function. The adoption of the deviation b adds a new adjustable parameter to the network, and makes it much more easier for the network input to become the desired target vector.

4 7532 B. Hu et al. /Journal of Computational Information Systems 11: 20 (2015) Perceptron learning rule Learning rule is an algorithm used for calculating the new weight matrix W and the new deviation B. It is adopted by perceptron to adjust the network weight so as to make the network s response to the input vector to become the target input whose value is 0 or 1. For the perceptron network whose input vector is P, output vector is A, and target vector is T, the parameter of the perceptron learning rule is adjusted in light of the conditions in which the output vector may possibly occur. 1. If the neuron output is correct, that is, a = t, then the neuron connective weight and the deviation b will remain unchanged. 2. If the neuron output is 0, while the desired output is 1, that is, a = 0, t = 1, then the algorithm for the modified weight is that the new weigh w i equals the former weight plus the input vector p i. 3. If the neuron output is 1, while the desired output is 0, namely, a = 0, t = 1, then the algorithm of the modified weight is that the new weight w i equals the former weight w i minus the input vector p i. In this paper, the deviation b is the weighted sum of the original setting value (a i) of the number of objects in both the target cells and the neighbor cells in our outlier mining experiment. It varies with the weight, namely: b = r a i w i i=1 According to the above analysis, the substance of the perceptron learning rule is that the weight variation equals either the negative or the positive input vector. Its specific algorithms are summarized as follows: For all i, where i = 1, 2,, r, the formula of the modified weight for the perceptron is: And it can be represented with the vector matrix as: w i = (t a) p i (1) W = W + EP T (2) In Eq. (2), E is the error vector, E = T A (3) 2.3 K-means algorithm K-means algorithm is one of the classic clustering algorithm whose main idea is clustering.the method first randomly selects k objects as the initial averages or centers of the clusters, then calculates the distance between the rest data pointsand the centers, and then allocates those points to the clusters whose centers are the nearest; finally itcomputes the new averages of each cluster. The above process will be repeated until the criterion function converges to a given terminational threshold. Its pseudo-code algorithm is listed below: Input: the number of clusters K, the dataset D which contains n objects Output: the corresponding classification results of the K clusters

5 B. Hu et al. /Journal of Computational Information Systems 11: 20 (2015) Randomly select K points from dataset D as the initial clustering center; 2. Calculate the distance between each point and the means(namely, the clustering centers); 3. Allocate each point to the cluster which is with the minimum distance; 4. Update the cluster means, namely, calculate each cluster mean again and take them as the new cluster centers; 5. Repeat Step(2) to Step(4), until the criterion function converges. 3 MR-ODCDC Algorithm 3.1 Related conceptions of MR-ODCDC Definition 1: Distance-based outliers: An object O in a given dataset D is a distance-based outlier which takes p and d as parameters and can be denoted as DB(p, d) outlier, if at least fraction p of the objects in D lies greater than distance d from O. Definition 2: Euclidean distance: For a given dataset D whose data dimension is d, the formula of the Euclidean distance of the two data objects O i and O j is defined as: dist (O i, O j ) = d (O ik O jk ) 2 (4) Definition 3: The first layer neighbors of C x,y, which is also called the immediately neighboring cells of C x,y, denoted as L 1 (C x,y ): k=i L 1 (C x,y ) = {C u,v u = x ± 1, v = y ± 1, C u,v C x,y } (5) Definition 4: The second layer neighbors of C x,y is of the thickness of two cells, denoted as L 2 (C x,y ): L 2 (C x,y ) = {C u,v u = x ± 3, v = y ± 3, C u,v L 1 (C x,y ), C u,v C x,y } (6) The data objects in the following two definitions are 2-D. If we partitioned each object into cells whose length is l = D 2 suppose that Euclidean distance is adopted as the metric function for 2 measuring the distance between the objects, and let C x,y denotes the cell that is at the intersection of row x and column y, the conceptions and properties of the first and the second layer neighbors are as follows: We can derive Definition 5 out from Definition 3 and Definition 4. Definition 5: If C u,v C x,y, C u,v is neither a layer 1 nor a layer 2 neighbor of C x,y and for objects p and q, p C u,v, q C x,y, then the distance between p and q must exceed D. And since the combined thickness of L 2 (C x,y ) and L 1 (C x,y ) is the thickness of 3 cells, the distance between p and q also exceeds 3l = 3D 2 2 > D. Again from Definition 5, we can deduced another definition: Definition 6: (1) If the number of the objects in C x,y is more than M, that is, count>m, none of the objects in C x,y are outliers; (2) If the summation of the number of the objects in

6 7534 B. Hu et al. /Journal of Computational Information Systems 11: 20 (2015) both C x,y and C x,y is more than M, that is, count>m, none of the objects in C x,y are outliers; (3) If the summation of the number of the objects in C x,y, L 1 (C x,y ) and C x,y less than M, that is, count<m, all the objects in are outliers. l=d 2 2 Fig. 3: Cell structure chart Definition 7: The criterion function of K-means algorithm is: E = k i=1 p KC i p m i 2 Where E is the squared error summation of all the objects in the dataset; P is a point in the space, representing a given object; and m i is the average value of cluster KC i (both p and m i are multidimensional). This criterion function increases the independence of the K generated micro-clusters. 3.2 The basic idea of MR-ODCDC algorithm The ideas of MR-ODCDC: The whole algorithm is divided into two subtasks by adopting a sequential combining MapReduce programming model. Each subtask is completed by using the MapReduce programming model in sequence. And the input of the second subtask is the output of the first one. The first subtask: We first cluster the dataset that is to be detected by employing K-means algorithm, then label all the points whose distance to the corresponding cluster centers are equal to or larger than the radius as the candidates of outliers. What calls for special attention is that if the number of the data points in a cluster is less than or equals to the pre-set number of outliers, the cluster can be directly labeled as a candidate set of outliers without comparing with the radius. The second subtask: We detect among the newly generated candidates of outliers with the cell-based outlier detection algorithm. When compared with the traditional distance-based outlier detection algorithms [11], our MapReduce-based outlier detection algorithm adopts both a single-layer perceptron and MapReduce, thus its performance is improved to some extent. Our algorithm first trains the single-layer perceptron with a group of training data. Each group of the data contains three elements, that is, the number of the objects in the target cell p 1. The number of the objects in the layer1 neighboring cell p 2 and that of the second neighboring cell p 3. Deviation b is the product sum of the pre-set threshold and the weight, that is

7 B. Hu et al. /Journal of Computational Information Systems 11: 20 (2015) N ( 1 w w w 49 3). By training the perceptron, we get the optimal weight value and then can calculate the threshold. MapReduce is adopted twice in the process of outlier mining. For the first time, MapReduce is used to preprocess the original massive data. The Map operation obtains from the data objects in S the valid data items and maps each data object to the corresponding cell, then it outputs the key value of the data objects, the valid data items and the mapping information. For the second time, MapReduce is applied for outlier mining. In this process, the Map operation inputs the data blocks formed on the basis of mapping information, adopts the cell-based outlier detection algorithm which brings in weight to mine outliers and outputs the mining result. The Reduce operation integrates the results and outputs the outliers collection. 3.3 Description of MR-ODCDC algorithm The first subtask: the parallel algorithm of K-means method. The MapReduce programming model adopts the idea of K-means algorithm. The Map operation randomly select k points as the initial clustering objects and each object is regarded as the initial clustering means or centers. It then calculates the distance between the rest data points and those centers of the clusters and allocates each point to the corresponding cluster who has the shortest distance from it. The reduce operation calculates the mean of each cluster. The obove process will be repeated until the criterion function converges. If the adjacent clustering centers undergoes no change, the pseudo-code algorithm is as follows: Inputthe number of the points k, and the dataset D which contains n objects. Output: corresponding collections of K clusters. 1. Randomly select k points from datasetd as the initial central points and denote them as K. 2. While criterion function changes. 3. Compare the distances between the denoted data points and their corresponding central points with Map function, output the current data point and the number of its corresponding central point whose distance from it is the shortest. 4. Integrate the data points that are under the same central point with reduce function, calculate the new means and output the results as new central points. 5. End while. The second subtask: cell-based outlier detection parallel algorithm. The improved MapReduce-based outlier detection algorithm adopts a single-layer perceptron and MapReduce programming model. This algorithm can be further divided into two stages: The first stage: Map operation attains the valid data objects in S, maps each data object to the corresponding cell and then outputs the key values of the data objects, the valid data items as well as the mapping information. The Reduce operation is responsible for the integration of the preprocessing results on the basis of the mapping information. It also outputs the integrative result. In this stage, MapReduce deals with the original massive data. The MapReduce-based outlier detection parallel algorithm first trains the single-layer perceptron with a group of training data, and obtains the optimal weight value, then calculates the threshold with the value. The training data consists of three elements which arethe number of the objects in the target cell p 1, the number of the objects in the layer1 neighboring cell p 2 and that of the second neighboring

8 7536 B. Hu et al. /Journal of Computational Information Systems 11: 20 (2015) cell p 3. Deviation b is the product sum of the pre-set threshold and the weight, that is ( 1 N 49 w w ) 49 w 3 The second stage: Map operation inputs the data blocks formed in accordance with the mapping information. It adopts the cell-based outlier detection algorithm in which weight value is introduced for outlier mining and outputs the mining results. Reduce operation integrates the mining results and outputs the collection of outliers. In this stage, MapReduce deals with outlier mining. The pseudo-code is listed below: Input: Large dataset S, radius of neighborhood d min, threshold of probability base η. Output: Collection of abnormal objects. (1) Prepare training data, train neuron network, attain weight value {w 1, w 2, w 3 }, calculate the new threshold M = N ( 1 w w w 49 3), where N is the number of the data in S. (2) Randomly divide the large dataset S into several data blocks {S 11, S 12,, S 1k }. Preprocess each data block S 1i through Map operation, map each data object to the corresponding cells and attain S 1i. Collect the preprocessed data blocks S 1i with reduce operation, integrate the data objects inaccordance with the mapping information and attain large dataset S 2. (3) Divide the large dataset S 2 into several data blocks in light of the mapping information {S 21, S 22,, S 2k }. Mine the outliers by adopting Map operation: for i = 1 to n // scan each cell C i if count i w 1 > M label the cell and the first layer neighborhood cell as true (namely, none of the points in the cell are outliers ) end for for each unlabeled cell, do: count i2 = count i w 1 + w 2 if count i2 > M, label this cell as true else count i3 = count i2 + w 3 count k k L 2 (C i ) if count i3 M label this cell as false else for each point O in C i j L 1 (C i ) count j scan each point P in L 2 (C j ) if dist (P, O) D: count 0 = count i2 end if count 0 = count w 3

9 B. Hu et al. /Journal of Computational Information Systems 11: 20 (2015) if count 0 > M, label point O as true else end if end for end if end if end for label point O as false ( namely O is an outlier) The output of the second map operation are the collection of the outliers, or the results of the < key, value > in the form of < dotmark, informationofdotobject >, which also functions as the input of the second Reduce operation. Reduce operation carries out global statistics of all received point information, and output the outliers in the form of < key, value > pair as the final result. The algorithm is accomplished when the Reduce operation is completed. In this algorithm, the first Map operation scans all the data objects, while the second Map operation first scans and denotes all the cells, and then scans the objects in the undenoted or unjudged cells. Thus, we can get the time complexity of the algorithm, namely O ( c k + n ), where k is the dimensionality of the dataset and c is a constant that based on the number of cells. In our paper, k = 2, c is the number of cells of the data blocks input in each single Map operation, and n is the number of the objects that are handled in a single Map operation. Therefore, both c and n in our algorithm are much smaller than those in the traditional algorithms, and the time complexity of our method is thus accordingly reduced. 4 Analysis of Experiments and Results 4.1 Efficiency analysis To verity the effectiveness of our algorithm, we select 4 groups of hybrid datasets for experimental analysis from the UCI datasets [13], and then preprocess the selected datasets with the method [14] introduced. To make it more convenient for analysis and comparison, first divides the original data into several types with a certain norm, then delete part of the objects from a certain type or several types of data to change the original datasets into imbalanced ones. The processed data information is shown in Table 1. Table 1: Datasets description dataset Number of samples Number of attributes category Breast Cancer Iris Page block We evaluate the Outlier accuracy by comparing MR-ODCDC algorithm with MR-ODMD algorithm [15] and MR-AVF algorithm [16]. Outlier accuracy, also known as outlier coverage which is introduced in [14], is very important metric for judging the validity of an algorithm. The ratio

10 7538 B. Hu et al. /Journal of Computational Information Systems 11: 20 (2015) of the true isolated points in the isolated points detected by an outlier detection algorithm to the original isolated points in the dataset is an important metric for evaluating the performance of the algorithm. A higher ratio indicates a better performance of the algorithm It is defined as: P = N K Where P is the accuracy, N is the number of the outliers detected by the algorithm, and K is the number of the outliers in the dataset. The number of outliers K Table 2: Iris (6 outliers) The number of outliers detected by algorithm N MR-ODMD MR-AVF MR-ODCDC The number of outliers K Table 3: Breast cancer (39 outliers) The number of outliers detected by algorithm N MR-ODMD MR-AVF MR-ODCDC The number of outliers K Table 4: Page blocks (280 outliers) The number of outliers detected by algorithm N MR-ODMD MR-AVF MR-ODCDC From the above 3 tables, we can see that, our MR-ODCDC algorithm is of good precision. For 2-class outliers detection, taking Breast Cancer dataset as an example, the performance of our MR-ODCDC algorithm is slightly better than the other two algorithms. And for normal

11 B. Hu et al. /Journal of Computational Information Systems 11: 20 (2015) multi-class outliers detection, such as outlier detection in Iris dataset and Page blocksdataset, our MR-ODCDC algorithm is much more effective. This is because our algorithm successfully eliminates part of the normal data in the original dataset and reduces the number of the objects in the dataset by clustering the data objects with K-means algorithm, and meanwhile many more outliers are discovered, thus the precision of the final result is greatly improved. 4.2 Analysis of parallel experiments and results In our experiments, Hadoop cluster platform is set up on Ubuntu with several PCs, JDK version is 1.7, and each node is formed with a PCwhose CPU is Intel Celeron(R) CPU E3500, master frequency is 2.70GHz, RAM is 2GB, and network card is 1 kilomega, we make one PC as the host node, denoted by Master (namely Na outlier detection menode), which is responsible for schedule; and the others as work node, denoted by Slave (namely, DataNode), who calculate. In Hadoop cluster configuration, the maximum loads of mapper are set by each node, the maximum load Reducer (mapred. tasktracker. map reduce. tasks. maximum) is set to 2, and the size of distributed generation of the whole colony HDFS is 1.2TB. To prove that our algorithm can perform better in handling massive data, our experiment adopts the KDD CUP1999 dataset [17] as its experimental data. We can modify the dataset with software [18] and attain 5 datasets whose size are 1k, 5k, 10k, 50k, 100k, respectively. In the condition that 4 PCs are clustered, the computing time of each dataset are successively 2.142s, 4.356s, 7.019s, s, s. As shown in Fig. 4, when the size of the dataset is under 5k, there is little difference between the performance of cluster processing and uniprocessing. And the former even takes a longer time than the latter for the reason that the base time of MapReduce is rather long, most of which is taken by data sharping, generation of the sorting numbers of the intermediate files and the communication between the nodes. While the clustering performance of MapReduce will increase as the size of the dataset gradually grows larger from 10k to 100k. The run time of the clustering performance is improved by 33.54%, when compared with that of the uniprocessing Run time /s cluster monoprocessor Size of data/kb Fig. 4: Comparison of run time between cluster and monoprocessor To better demonstrate the advantage of clustering processing, we use 1PC, 2PCs, 4PCs and 8 PCs to respectively detect the outliers in the datasets whose size are 1k, 5k, 10k, 50k, 100k. The

12 7540 B. Hu et al. /Journal of Computational Information Systems 11: 20 (2015) computing time are illustrated in Fig. 5. As shown in Fig. 5, when the size of the dataset is rather small, the computing time of the MapReduce-based algorithm is slightly higher than that of uniprocessing due to the base time taken by MapReduce operation. And still, as the size of the dataset gradually grows to 10k, the acceleration of uniprocessing continues to be larger than that of clustering processing. However, the larger the number of the DataNodes is, the stronger the computing power of our algorithm will have. When the size of the dataset reaches 100k, the efficiency of clustering processing improves by 54.13%, comparing with that of uniprocessing. And the clustering operation with 8PCs has the minimum speed-up ratio, thus it has the shortest run time Run time /s pc 2 pcs 4 pcs 8 pcs Size of data/kb Fig. 5: Comparison diagram of run time 5 Conclusion Outlier detection is a branch of great importance in the field of data mining and it has already found wide application in many areas. The traditional distance-based outlier detection algorithms are belong to memory resident method, which is less than ideal in dealing with massive data. In this paper, we propose a two-stage outlier detection algorithm which is based on MapReduce programming model and adopts the K-means algorithm for clustering. Our algorithm first divides the dataset into several micro-cluster components on the basis of K-means algorithm,then trains the single-layer perceptron with the training data in the micro-cluster components to attain the optimal weight; finally it further improves its efficiency by realizing parallelization with the MapReduce programming model. We have experimentally demonstrate that our algorithm for outlier detection has advantages over the existed distributed outlier detection algorithms on efficiency and accuracy in dealing with large high dimensional dataset. To further extend the applicability of outlier detection techniques, in future, we will try to tackle several other challenging problems in this field, such as the outlier detection in uncertain data and that in complex data objects which includes spatial data, spatio-temporal data and multimedia data, etc.

13 B. Hu et al. /Journal of Computational Information Systems 11: 20 (2015) References [1] A. R. Xue, S. G. Ju, W. H. He, W. H. Chen, Research on Local Outliers Mining Algorithm. Chinese Journal of Computers, vol. 30, no. 8, pp , [2] J. W. Han, M. Kamber, Data Mining: Concepts and Techniques (2nd edition). Morgan Kaufmann Publishers, San Francisco, [3] P. N. Tan, M. Steinbach, V. Kumar, Introduction to Data Mining, Addison-Wesley, New York [4] C. P. Hu, X. L. Qin, ADensity-based Local Outlier Detecting Algorithm, Journal of ComputerResearch and Development, vol. 47, no. 12, pp , [5] V. Barnet, T. LewisOutliers, Outliers in Statistical Data, John Wiley and Sons, New York, [6] T. Johnson, I. Kwok, Ng. Raymond, Fast Computation Of 2-dimensional Depth Contours, ACM Internal Conference on Knowledge Discovery and Data Mining, New York, [7] M. M. Breunig, H. P. Kriegel, Ng. Raymond, LOF: Identifying Density-based Local Outliers, ACM Internal Conference on Knowledge Discovery and Data Mining, New York, [8] E. M. Knorr, Ng. Raymond, Algorithms for Mining Distance-based Outliers in Large Dataset, ACM Proceedings of the 24rd International Conference on Very Large Data Bases, New York 1998: [9] A. K. Jain, M. N. Murty, P. J. Flynn, Data Clustering: AReview, Journal ACM Computing Surveys, vol. 31, no. 3, pp , [10] K. Chen, W. M.Zheng, Cloud Computing: System Instances and Current Research, Journal of Software, vol. 50, no. 2, pp , [11] P. Mell, T. Grance, The NIST Definition Of Cloud Computing, National Institute of Standards and Technology, Gaithersburg, [12] J. Zhang, Z. H. Sun, M. Yang, Fast Incremental Outlier Mining Algorithm Based On Grid and Capacity, Journal of Computer Research and Development, vol. 45, no. 5, pp , [13] C. Blake, C. Merz, UCI machine learning repository, html. [14] C. C. Aggarwal, P. S. Yu, Outlier Detection For High Dimensional Data, Proceedings of ACM SIGMOD international conference on Management of data, [15] Y. P. Guo, J. Y. Liang, X. W. Zhao, An Outlier Detection Algorithm for Mixed Data Based on MapReduce, Journal of Chinese Computer Systems, vol. 35, no. 9, [16] A. Koufakou, J. Secretan, J. Reeder, Fast Parallel Outlier Detection for Categorical Datasets using MapReduce, International Joint Conference on Neural Networks, [17] W. W. Ni, G. Chen, J. P. Lu, Y. J. Wu, Z. H. Sun, Weighted Subspace Outlier Detection Algorithm Based on Local Information Entropy, Computer Research and Development, vol. 8, no. 2, pp , [18] D. Cristofor, D. Simovici, Finding Median Partitions Using Information-Theoretical Genetic Algorithms, Journal of Universal Computer Science, vol. 8, no. 2, pp , 2002.

Research on Clustering Analysis of Big Data Yuan Yuanming 1, 2, a, Wu Chanle 1, 2

Research on Clustering Analysis of Big Data Yuan Yuanming 1, 2, a, Wu Chanle 1, 2 Advanced Engineering Forum Vols. 6-7 (2012) pp 82-87 Online: 2012-09-26 (2012) Trans Tech Publications, Switzerland doi:10.4028/www.scientific.net/aef.6-7.82 Research on Clustering Analysis of Big Data

More information

Robust Outlier Detection Technique in Data Mining: A Univariate Approach

Robust Outlier Detection Technique in Data Mining: A Univariate Approach Robust Outlier Detection Technique in Data Mining: A Univariate Approach Singh Vijendra and Pathak Shivani Faculty of Engineering and Technology Mody Institute of Technology and Science Lakshmangarh, Sikar,

More information

Map/Reduce Affinity Propagation Clustering Algorithm

Map/Reduce Affinity Propagation Clustering Algorithm Map/Reduce Affinity Propagation Clustering Algorithm Wei-Chih Hung, Chun-Yen Chu, and Yi-Leh Wu Department of Computer Science and Information Engineering, National Taiwan University of Science and Technology,

More information

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Journal of Computational Information Systems 7: 16 (2011) 5956-5963 Available at http://www.jofcis.com Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Ping ZHOU, Jingsheng LEI, Wenjun YE

More information

An Analysis on Density Based Clustering of Multi Dimensional Spatial Data

An Analysis on Density Based Clustering of Multi Dimensional Spatial Data An Analysis on Density Based Clustering of Multi Dimensional Spatial Data K. Mumtaz 1 Assistant Professor, Department of MCA Vivekanandha Institute of Information and Management Studies, Tiruchengode,

More information

Distributed forests for MapReduce-based machine learning

Distributed forests for MapReduce-based machine learning Distributed forests for MapReduce-based machine learning Ryoji Wakayama, Ryuei Murata, Akisato Kimura, Takayoshi Yamashita, Yuji Yamauchi, Hironobu Fujiyoshi Chubu University, Japan. NTT Communication

More information

UPS battery remote monitoring system in cloud computing

UPS battery remote monitoring system in cloud computing , pp.11-15 http://dx.doi.org/10.14257/astl.2014.53.03 UPS battery remote monitoring system in cloud computing Shiwei Li, Haiying Wang, Qi Fan School of Automation, Harbin University of Science and Technology

More information

An analysis of suitable parameters for efficiently applying K-means clustering to large TCPdump data set using Hadoop framework

An analysis of suitable parameters for efficiently applying K-means clustering to large TCPdump data set using Hadoop framework An analysis of suitable parameters for efficiently applying K-means clustering to large TCPdump data set using Hadoop framework Jakrarin Therdphapiyanak Dept. of Computer Engineering Chulalongkorn University

More information

Data Mining Project Report. Document Clustering. Meryem Uzun-Per

Data Mining Project Report. Document Clustering. Meryem Uzun-Per Data Mining Project Report Document Clustering Meryem Uzun-Per 504112506 Table of Content Table of Content... 2 1. Project Definition... 3 2. Literature Survey... 3 3. Methods... 4 3.1. K-means algorithm...

More information

A Survey on Outlier Detection Techniques for Credit Card Fraud Detection

A Survey on Outlier Detection Techniques for Credit Card Fraud Detection IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661, p- ISSN: 2278-8727Volume 16, Issue 2, Ver. VI (Mar-Apr. 2014), PP 44-48 A Survey on Outlier Detection Techniques for Credit Card Fraud

More information

An Evaluation of Machine Learning Method for Intrusion Detection System Using LOF on Jubatus

An Evaluation of Machine Learning Method for Intrusion Detection System Using LOF on Jubatus An Evaluation of Machine Learning Method for Intrusion Detection System Using LOF on Jubatus Tadashi Ogino* Okinawa National College of Technology, Okinawa, Japan. * Corresponding author. Email: ogino@okinawa-ct.ac.jp

More information

A Load Balancing Algorithm based on the Variation Trend of Entropy in Homogeneous Cluster

A Load Balancing Algorithm based on the Variation Trend of Entropy in Homogeneous Cluster , pp.11-20 http://dx.doi.org/10.14257/ ijgdc.2014.7.2.02 A Load Balancing Algorithm based on the Variation Trend of Entropy in Homogeneous Cluster Kehe Wu 1, Long Chen 2, Shichao Ye 2 and Yi Li 2 1 Beijing

More information

Random forest algorithm in big data environment

Random forest algorithm in big data environment Random forest algorithm in big data environment Yingchun Liu * School of Economics and Management, Beihang University, Beijing 100191, China Received 1 September 2014, www.cmnt.lv Abstract Random forest

More information

Advances in Natural and Applied Sciences

Advances in Natural and Applied Sciences AENSI Journals Advances in Natural and Applied Sciences ISSN:1995-0772 EISSN: 1998-1090 Journal home page: www.aensiweb.com/anas Clustering Algorithm Based On Hadoop for Big Data 1 Jayalatchumy D. and

More information

Data quality in Accounting Information Systems

Data quality in Accounting Information Systems Data quality in Accounting Information Systems Comparing Several Data Mining Techniques Erjon Zoto Department of Statistics and Applied Informatics Faculty of Economy, University of Tirana Tirana, Albania

More information

Design of Electric Energy Acquisition System on Hadoop

Design of Electric Energy Acquisition System on Hadoop , pp.47-54 http://dx.doi.org/10.14257/ijgdc.2015.8.5.04 Design of Electric Energy Acquisition System on Hadoop Yi Wu 1 and Jianjun Zhou 2 1 School of Information Science and Technology, Heilongjiang University

More information

Categorical Data Visualization and Clustering Using Subjective Factors

Categorical Data Visualization and Clustering Using Subjective Factors Categorical Data Visualization and Clustering Using Subjective Factors Chia-Hui Chang and Zhi-Kai Ding Department of Computer Science and Information Engineering, National Central University, Chung-Li,

More information

Map-Reduce Algorithm for Mining Outliers in the Large Data Sets using Twister Programming Model

Map-Reduce Algorithm for Mining Outliers in the Large Data Sets using Twister Programming Model Map-Reduce Algorithm for Mining Outliers in the Large Data Sets using Twister Programming Model Subramanyam. RBV, Sonam. Gupta Abstract An important problem that appears often when analyzing data involves

More information

K Thangadurai P.G. and Research Department of Computer Science, Government Arts College (Autonomous), Karur, India. ktramprasad04@yahoo.

K Thangadurai P.G. and Research Department of Computer Science, Government Arts College (Autonomous), Karur, India. ktramprasad04@yahoo. Enhanced Binary Small World Optimization Algorithm for High Dimensional Datasets K Thangadurai P.G. and Research Department of Computer Science, Government Arts College (Autonomous), Karur, India. ktramprasad04@yahoo.com

More information

Mobile Storage and Search Engine of Information Oriented to Food Cloud

Mobile Storage and Search Engine of Information Oriented to Food Cloud Advance Journal of Food Science and Technology 5(10): 1331-1336, 2013 ISSN: 2042-4868; e-issn: 2042-4876 Maxwell Scientific Organization, 2013 Submitted: May 29, 2013 Accepted: July 04, 2013 Published:

More information

The Role of Visualization in Effective Data Cleaning

The Role of Visualization in Effective Data Cleaning The Role of Visualization in Effective Data Cleaning Yu Qian Dept. of Computer Science The University of Texas at Dallas Richardson, TX 75083-0688, USA qianyu@student.utdallas.edu Kang Zhang Dept. of Computer

More information

Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques

Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques Subhashree K 1, Prakash P S 2 1 Student, Kongu Engineering College, Perundurai, Erode 2 Assistant Professor,

More information

Classification On The Clouds Using MapReduce

Classification On The Clouds Using MapReduce Classification On The Clouds Using MapReduce Simão Martins Instituto Superior Técnico Lisbon, Portugal simao.martins@tecnico.ulisboa.pt Cláudia Antunes Instituto Superior Técnico Lisbon, Portugal claudia.antunes@tecnico.ulisboa.pt

More information

http://www.paper.edu.cn

http://www.paper.edu.cn 5 10 15 20 25 30 35 A platform for massive railway information data storage # SHAN Xu 1, WANG Genying 1, LIU Lin 2** (1. Key Laboratory of Communication and Information Systems, Beijing Municipal Commission

More information

Hadoop Operations Management for Big Data Clusters in Telecommunication Industry

Hadoop Operations Management for Big Data Clusters in Telecommunication Industry Hadoop Operations Management for Big Data Clusters in Telecommunication Industry N. Kamalraj Asst. Prof., Department of Computer Technology Dr. SNS Rajalakshmi College of Arts and Science Coimbatore-49

More information

Method of Fault Detection in Cloud Computing Systems

Method of Fault Detection in Cloud Computing Systems , pp.205-212 http://dx.doi.org/10.14257/ijgdc.2014.7.3.21 Method of Fault Detection in Cloud Computing Systems Ying Jiang, Jie Huang, Jiaman Ding and Yingli Liu Yunnan Key Lab of Computer Technology Application,

More information

How To Solve The Kd Cup 2010 Challenge

How To Solve The Kd Cup 2010 Challenge A Lightweight Solution to the Educational Data Mining Challenge Kun Liu Yan Xing Faculty of Automation Guangdong University of Technology Guangzhou, 510090, China catch0327@yahoo.com yanxing@gdut.edu.cn

More information

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu Medical Information Management & Mining You Chen Jan,15, 2013 You.chen@vanderbilt.edu 1 Trees Building Materials Trees cannot be used to build a house directly. How can we transform trees to building materials?

More information

Load balancing in a heterogeneous computer system by self-organizing Kohonen network

Load balancing in a heterogeneous computer system by self-organizing Kohonen network Bull. Nov. Comp. Center, Comp. Science, 25 (2006), 69 74 c 2006 NCC Publisher Load balancing in a heterogeneous computer system by self-organizing Kohonen network Mikhail S. Tarkov, Yakov S. Bezrukov Abstract.

More information

Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing and Developing E-mail Classifier

Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing and Developing E-mail Classifier International Journal of Recent Technology and Engineering (IJRTE) ISSN: 2277-3878, Volume-1, Issue-6, January 2013 Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing

More information

Cloud Computing based on the Hadoop Platform

Cloud Computing based on the Hadoop Platform Cloud Computing based on the Hadoop Platform Harshita Pandey 1 UG, Department of Information Technology RKGITW, Ghaziabad ABSTRACT In the recent years,cloud computing has come forth as the new IT paradigm.

More information

Analysis of MapReduce Algorithms

Analysis of MapReduce Algorithms Analysis of MapReduce Algorithms Harini Padmanaban Computer Science Department San Jose State University San Jose, CA 95192 408-924-1000 harini.gomadam@gmail.com ABSTRACT MapReduce is a programming model

More information

Hadoop Architecture. Part 1

Hadoop Architecture. Part 1 Hadoop Architecture Part 1 Node, Rack and Cluster: A node is simply a computer, typically non-enterprise, commodity hardware for nodes that contain data. Consider we have Node 1.Then we can add more nodes,

More information

Chapter 7. Using Hadoop Cluster and MapReduce

Chapter 7. Using Hadoop Cluster and MapReduce Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in

More information

Similarity Search in a Very Large Scale Using Hadoop and HBase

Similarity Search in a Very Large Scale Using Hadoop and HBase Similarity Search in a Very Large Scale Using Hadoop and HBase Stanislav Barton, Vlastislav Dohnal, Philippe Rigaux LAMSADE - Universite Paris Dauphine, France Internet Memory Foundation, Paris, France

More information

Efficient Data Replication Scheme based on Hadoop Distributed File System

Efficient Data Replication Scheme based on Hadoop Distributed File System , pp. 177-186 http://dx.doi.org/10.14257/ijseia.2015.9.12.16 Efficient Data Replication Scheme based on Hadoop Distributed File System Jungha Lee 1, Jaehwa Chung 2 and Daewon Lee 3* 1 Division of Supercomputing,

More information

IMPLEMENTATION OF P-PIC ALGORITHM IN MAP REDUCE TO HANDLE BIG DATA

IMPLEMENTATION OF P-PIC ALGORITHM IN MAP REDUCE TO HANDLE BIG DATA IMPLEMENTATION OF P-PIC ALGORITHM IN MAP REDUCE TO HANDLE BIG DATA Jayalatchumy D 1, Thambidurai. P 2 Abstract Clustering is a process of grouping objects that are similar among themselves but dissimilar

More information

A Data Cleaning Model for Electric Power Big Data Based on Spark Framework 1

A Data Cleaning Model for Electric Power Big Data Based on Spark Framework 1 , pp.405-411 http://dx.doi.org/10.14257/astl.2016. A Data Cleaning Model for Electric Power Big Data Based on Spark Framework 1 Zhao-Yang Qu 1, Yong-Wen Wang 2,2, Chong Wang 3, Nan Qu 4 and Jia Yan 5 1,

More information

A Web-based Interactive Data Visualization System for Outlier Subspace Analysis

A Web-based Interactive Data Visualization System for Outlier Subspace Analysis A Web-based Interactive Data Visualization System for Outlier Subspace Analysis Dong Liu, Qigang Gao Computer Science Dalhousie University Halifax, NS, B3H 1W5 Canada dongl@cs.dal.ca qggao@cs.dal.ca Hai

More information

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data CMPE 59H Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data Term Project Report Fatma Güney, Kübra Kalkan 1/15/2013 Keywords: Non-linear

More information

Standardization and Its Effects on K-Means Clustering Algorithm

Standardization and Its Effects on K-Means Clustering Algorithm Research Journal of Applied Sciences, Engineering and Technology 6(7): 399-3303, 03 ISSN: 040-7459; e-issn: 040-7467 Maxwell Scientific Organization, 03 Submitted: January 3, 03 Accepted: February 5, 03

More information

Hadoop Scheduler w i t h Deadline Constraint

Hadoop Scheduler w i t h Deadline Constraint Hadoop Scheduler w i t h Deadline Constraint Geetha J 1, N UdayBhaskar 2, P ChennaReddy 3,Neha Sniha 4 1,4 Department of Computer Science and Engineering, M S Ramaiah Institute of Technology, Bangalore,

More information

The Scientific Data Mining Process

The Scientific Data Mining Process Chapter 4 The Scientific Data Mining Process When I use a word, Humpty Dumpty said, in rather a scornful tone, it means just what I choose it to mean neither more nor less. Lewis Carroll [87, p. 214] In

More information

Comparison of K-means and Backpropagation Data Mining Algorithms

Comparison of K-means and Backpropagation Data Mining Algorithms Comparison of K-means and Backpropagation Data Mining Algorithms Nitu Mathuriya, Dr. Ashish Bansal Abstract Data mining has got more and more mature as a field of basic research in computer science and

More information

Client Based Power Iteration Clustering Algorithm to Reduce Dimensionality in Big Data

Client Based Power Iteration Clustering Algorithm to Reduce Dimensionality in Big Data Client Based Power Iteration Clustering Algorithm to Reduce Dimensionalit in Big Data Jaalatchum. D 1, Thambidurai. P 1, Department of CSE, PKIET, Karaikal, India Abstract - Clustering is a group of objects

More information

Data Mining for Data Cloud and Compute Cloud

Data Mining for Data Cloud and Compute Cloud Data Mining for Data Cloud and Compute Cloud Prof. Uzma Ali 1, Prof. Punam Khandar 2 Assistant Professor, Dept. Of Computer Application, SRCOEM, Nagpur, India 1 Assistant Professor, Dept. Of Computer Application,

More information

Social Media Mining. Data Mining Essentials

Social Media Mining. Data Mining Essentials Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

More information

Telecom Data processing and analysis based on Hadoop

Telecom Data processing and analysis based on Hadoop COMPUTER MODELLING & NEW TECHNOLOGIES 214 18(12B) 658-664 Abstract Telecom Data processing and analysis based on Hadoop Guofan Lu, Qingnian Zhang *, Zhao Chen Wuhan University of Technology, Wuhan 4363,China

More information

Big Data with Rough Set Using Map- Reduce

Big Data with Rough Set Using Map- Reduce Big Data with Rough Set Using Map- Reduce Mr.G.Lenin 1, Mr. A. Raj Ganesh 2, Mr. S. Vanarasan 3 Assistant Professor, Department of CSE, Podhigai College of Engineering & Technology, Tirupattur, Tamilnadu,

More information

TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM

TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM Thanh-Nghi Do College of Information Technology, Cantho University 1 Ly Tu Trong Street, Ninh Kieu District Cantho City, Vietnam

More information

CLOUDDMSS: CLOUD-BASED DISTRIBUTED MULTIMEDIA STREAMING SERVICE SYSTEM FOR HETEROGENEOUS DEVICES

CLOUDDMSS: CLOUD-BASED DISTRIBUTED MULTIMEDIA STREAMING SERVICE SYSTEM FOR HETEROGENEOUS DEVICES CLOUDDMSS: CLOUD-BASED DISTRIBUTED MULTIMEDIA STREAMING SERVICE SYSTEM FOR HETEROGENEOUS DEVICES 1 MYOUNGJIN KIM, 2 CUI YUN, 3 SEUNGHO HAN, 4 HANKU LEE 1,2,3,4 Department of Internet & Multimedia Engineering,

More information

EFFICIENT DATA PRE-PROCESSING FOR DATA MINING

EFFICIENT DATA PRE-PROCESSING FOR DATA MINING EFFICIENT DATA PRE-PROCESSING FOR DATA MINING USING NEURAL NETWORKS JothiKumar.R 1, Sivabalan.R.V 2 1 Research scholar, Noorul Islam University, Nagercoil, India Assistant Professor, Adhiparasakthi College

More information

Quality Assessment in Spatial Clustering of Data Mining

Quality Assessment in Spatial Clustering of Data Mining Quality Assessment in Spatial Clustering of Data Mining Azimi, A. and M.R. Delavar Centre of Excellence in Geomatics Engineering and Disaster Management, Dept. of Surveying and Geomatics Engineering, Engineering

More information

Distributed Framework for Data Mining As a Service on Private Cloud

Distributed Framework for Data Mining As a Service on Private Cloud RESEARCH ARTICLE OPEN ACCESS Distributed Framework for Data Mining As a Service on Private Cloud Shraddha Masih *, Sanjay Tanwani** *Research Scholar & Associate Professor, School of Computer Science &

More information

Log Mining Based on Hadoop s Map and Reduce Technique

Log Mining Based on Hadoop s Map and Reduce Technique Log Mining Based on Hadoop s Map and Reduce Technique ABSTRACT: Anuja Pandit Department of Computer Science, anujapandit25@gmail.com Amruta Deshpande Department of Computer Science, amrutadeshpande1991@gmail.com

More information

Chapter 2 The Research on Fault Diagnosis of Building Electrical System Based on RBF Neural Network

Chapter 2 The Research on Fault Diagnosis of Building Electrical System Based on RBF Neural Network Chapter 2 The Research on Fault Diagnosis of Building Electrical System Based on RBF Neural Network Qian Wu, Yahui Wang, Long Zhang and Li Shen Abstract Building electrical system fault diagnosis is the

More information

How To Classify Data Stream Mining

How To Classify Data Stream Mining JOURNAL OF COMPUTERS, VOL. 8, NO. 11, NOVEMBER 2013 2873 A Semi-supervised Ensemble Approach for Mining Data Streams Jing Liu 1,2, Guo-sheng Xu 1,2, Da Xiao 1,2, Li-ze Gu 1,2, Xin-xin Niu 1,2 1.Information

More information

Optimization of Distributed Crawler under Hadoop

Optimization of Distributed Crawler under Hadoop MATEC Web of Conferences 22, 0202 9 ( 2015) DOI: 10.1051/ matecconf/ 2015220202 9 C Owned by the authors, published by EDP Sciences, 2015 Optimization of Distributed Crawler under Hadoop Xiaochen Zhang*

More information

Benchmarking Hadoop & HBase on Violin

Benchmarking Hadoop & HBase on Violin Technical White Paper Report Technical Report Benchmarking Hadoop & HBase on Violin Harnessing Big Data Analytics at the Speed of Memory Version 1.0 Abstract The purpose of benchmarking is to show advantages

More information

Role of Neural network in data mining

Role of Neural network in data mining Role of Neural network in data mining Chitranjanjit kaur Associate Prof Guru Nanak College, Sukhchainana Phagwara,(GNDU) Punjab, India Pooja kapoor Associate Prof Swami Sarvanand Group Of Institutes Dinanagar(PTU)

More information

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set Amhmed A. Bhih School of Electrical and Electronic Engineering Princy Johnson School of Electrical and Electronic Engineering Martin

More information

MapReduce Approach to Collective Classification for Networks

MapReduce Approach to Collective Classification for Networks MapReduce Approach to Collective Classification for Networks Wojciech Indyk 1, Tomasz Kajdanowicz 1, Przemyslaw Kazienko 1, and Slawomir Plamowski 1 Wroclaw University of Technology, Wroclaw, Poland Faculty

More information

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015 An Introduction to Data Mining for Wind Power Management Spring 2015 Big Data World Every minute: Google receives over 4 million search queries Facebook users share almost 2.5 million pieces of content

More information

Large data computing using Clustering algorithms based on Hadoop

Large data computing using Clustering algorithms based on Hadoop Large data computing using Clustering algorithms based on Hadoop Samrudhi Tabhane, Prof. R.A.Fadnavis Dept. Of Information Technology,YCCE, email id: sstabhane@gmail.com Abstract The Hadoop Distributed

More information

A Small-time Scale Netflow-based Anomaly Traffic Detecting Method Using MapReduce

A Small-time Scale Netflow-based Anomaly Traffic Detecting Method Using MapReduce , pp.231-242 http://dx.doi.org/10.14257/ijsia.2014.8.2.24 A Small-time Scale Netflow-based Anomaly Traffic Detecting Method Using MapReduce Wang Jin-Song, Zhang Long, Shi Kai and Zhang Hong-hao School

More information

Accelerating BIRCH for Clustering Large Scale Streaming Data Using CUDA Dynamic Parallelism

Accelerating BIRCH for Clustering Large Scale Streaming Data Using CUDA Dynamic Parallelism Accelerating BIRCH for Clustering Large Scale Streaming Data Using CUDA Dynamic Parallelism Jianqiang Dong, Fei Wang and Bo Yuan Intelligent Computing Lab, Division of Informatics Graduate School at Shenzhen,

More information

http://www.wordle.net/

http://www.wordle.net/ Hadoop & MapReduce http://www.wordle.net/ http://www.wordle.net/ Hadoop is an open-source software framework (or platform) for Reliable + Scalable + Distributed Storage/Computational unit Failures completely

More information

ENSEMBLE DECISION TREE CLASSIFIER FOR BREAST CANCER DATA

ENSEMBLE DECISION TREE CLASSIFIER FOR BREAST CANCER DATA ENSEMBLE DECISION TREE CLASSIFIER FOR BREAST CANCER DATA D.Lavanya 1 and Dr.K.Usha Rani 2 1 Research Scholar, Department of Computer Science, Sree Padmavathi Mahila Visvavidyalayam, Tirupati, Andhra Pradesh,

More information

Figure 1. The cloud scales: Amazon EC2 growth [2].

Figure 1. The cloud scales: Amazon EC2 growth [2]. - Chung-Cheng Li and Kuochen Wang Department of Computer Science National Chiao Tung University Hsinchu, Taiwan 300 shinji10343@hotmail.com, kwang@cs.nctu.edu.tw Abstract One of the most important issues

More information

Towards the Optimization of Data Mining Execution Process in Distributed Environments

Towards the Optimization of Data Mining Execution Process in Distributed Environments Journal of Computational Information Systems 7: 8 (2011) 2931-2939 Available at http://www.jofcis.com Towards the Optimization of Data Mining Execution Process in Distributed Environments Yan ZHANG 1,,

More information

Analysis and Optimization of Massive Data Processing on High Performance Computing Architecture

Analysis and Optimization of Massive Data Processing on High Performance Computing Architecture Analysis and Optimization of Massive Data Processing on High Performance Computing Architecture He Huang, Shanshan Li, Xiaodong Yi, Feng Zhang, Xiangke Liao and Pan Dong School of Computer Science National

More information

processed parallely over the cluster nodes. Mapreduce thus provides a distributed approach to solve complex and lengthy problems

processed parallely over the cluster nodes. Mapreduce thus provides a distributed approach to solve complex and lengthy problems Big Data Clustering Using Genetic Algorithm On Hadoop Mapreduce Nivranshu Hans, Sana Mahajan, SN Omkar Abstract: Cluster analysis is used to classify similar objects under same group. It is one of the

More information

Map-Reduce for Machine Learning on Multicore

Map-Reduce for Machine Learning on Multicore Map-Reduce for Machine Learning on Multicore Chu, et al. Problem The world is going multicore New computers - dual core to 12+-core Shift to more concurrent programming paradigms and languages Erlang,

More information

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 4/8/2004 Hierarchical

More information

Predicting the Risk of Heart Attacks using Neural Network and Decision Tree

Predicting the Risk of Heart Attacks using Neural Network and Decision Tree Predicting the Risk of Heart Attacks using Neural Network and Decision Tree S.Florence 1, N.G.Bhuvaneswari Amma 2, G.Annapoorani 3, K.Malathi 4 PG Scholar, Indian Institute of Information Technology, Srirangam,

More information

Finding Insights & Hadoop Cluster Performance Analysis over Census Dataset Using Big-Data Analytics

Finding Insights & Hadoop Cluster Performance Analysis over Census Dataset Using Big-Data Analytics Finding Insights & Hadoop Cluster Performance Analysis over Census Dataset Using Big-Data Analytics Dharmendra Agawane 1, Rohit Pawar 2, Pavankumar Purohit 3, Gangadhar Agre 4 Guide: Prof. P B Jawade 2

More information

The Improved Job Scheduling Algorithm of Hadoop Platform

The Improved Job Scheduling Algorithm of Hadoop Platform The Improved Job Scheduling Algorithm of Hadoop Platform Yingjie Guo a, Linzhi Wu b, Wei Yu c, Bin Wu d, Xiaotian Wang e a,b,c,d,e University of Chinese Academy of Sciences 100408, China b Email: wulinzhi1001@163.com

More information

Resource-bounded Fraud Detection

Resource-bounded Fraud Detection Resource-bounded Fraud Detection Luis Torgo LIAAD-INESC Porto LA / FEP, University of Porto R. de Ceuta, 118, 6., 4050-190 Porto, Portugal ltorgo@liaad.up.pt http://www.liaad.up.pt/~ltorgo Abstract. This

More information

Applied research on data mining platform for weather forecast based on cloud storage

Applied research on data mining platform for weather forecast based on cloud storage Applied research on data mining platform for weather forecast based on cloud storage Haiyan Song¹, Leixiao Li 2* and Yuhong Fan 3* 1 Department of Software Engineering t, Inner Mongolia Electronic Information

More information

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU Heshan Li, Shaopeng Wang The Johns Hopkins University 3400 N. Charles Street Baltimore, Maryland 21218 {heshanli, shaopeng}@cs.jhu.edu 1 Overview

More information

The Comprehensive Performance Rating for Hadoop Clusters on Cloud Computing Platform

The Comprehensive Performance Rating for Hadoop Clusters on Cloud Computing Platform The Comprehensive Performance Rating for Hadoop Clusters on Cloud Computing Platform Fong-Hao Liu, Ya-Ruei Liou, Hsiang-Fu Lo, Ko-Chin Chang, and Wei-Tsong Lee Abstract Virtualization platform solutions

More information

Data Security Strategy Based on Artificial Immune Algorithm for Cloud Computing

Data Security Strategy Based on Artificial Immune Algorithm for Cloud Computing Appl. Math. Inf. Sci. 7, No. 1L, 149-153 (2013) 149 Applied Mathematics & Information Sciences An International Journal Data Security Strategy Based on Artificial Immune Algorithm for Cloud Computing Chen

More information

Feature Selection using Integer and Binary coded Genetic Algorithm to improve the performance of SVM Classifier

Feature Selection using Integer and Binary coded Genetic Algorithm to improve the performance of SVM Classifier Feature Selection using Integer and Binary coded Genetic Algorithm to improve the performance of SVM Classifier D.Nithya a, *, V.Suganya b,1, R.Saranya Irudaya Mary c,1 Abstract - This paper presents,

More information

A Stock Pattern Recognition Algorithm Based on Neural Networks

A Stock Pattern Recognition Algorithm Based on Neural Networks A Stock Pattern Recognition Algorithm Based on Neural Networks Xinyu Guo guoxinyu@icst.pku.edu.cn Xun Liang liangxun@icst.pku.edu.cn Xiang Li lixiang@icst.pku.edu.cn Abstract pattern respectively. Recent

More information

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012 MapReduce and Hadoop Aaron Birkland Cornell Center for Advanced Computing January 2012 Motivation Simple programming model for Big Data Distributed, parallel but hides this Established success at petabyte

More information

Detection of Distributed Denial of Service Attack with Hadoop on Live Network

Detection of Distributed Denial of Service Attack with Hadoop on Live Network Detection of Distributed Denial of Service Attack with Hadoop on Live Network Suchita Korad 1, Shubhada Kadam 2, Prajakta Deore 3, Madhuri Jadhav 4, Prof.Rahul Patil 5 Students, Dept. of Computer, PCCOE,

More information

HYBRID INTRUSION DETECTION FOR CLUSTER BASED WIRELESS SENSOR NETWORK

HYBRID INTRUSION DETECTION FOR CLUSTER BASED WIRELESS SENSOR NETWORK HYBRID INTRUSION DETECTION FOR CLUSTER BASED WIRELESS SENSOR NETWORK 1 K.RANJITH SINGH 1 Dept. of Computer Science, Periyar University, TamilNadu, India 2 T.HEMA 2 Dept. of Computer Science, Periyar University,

More information

Chapter 3: Cluster Analysis

Chapter 3: Cluster Analysis Chapter 3: Cluster Analysis 3.1 Basic Concepts of Clustering 3.2 Partitioning Methods 3.3 Hierarchical Methods 3.4 Density-Based Methods 3.5 Model-Based Methods 3.6 Clustering High-Dimensional Data 3.7

More information

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA http://kzhang6.people.uic.edu/tutorial/amcis2014.html August 7, 2014 Schedule I. Introduction to big data

More information

Big Data Analytics of Multi-Relationship Online Social Network Based on Multi-Subnet Composited Complex Network

Big Data Analytics of Multi-Relationship Online Social Network Based on Multi-Subnet Composited Complex Network , pp.273-284 http://dx.doi.org/10.14257/ijdta.2015.8.5.24 Big Data Analytics of Multi-Relationship Online Social Network Based on Multi-Subnet Composited Complex Network Gengxin Sun 1, Sheng Bin 2 and

More information

Introduction to Data Mining

Introduction to Data Mining Introduction to Data Mining 1 Why Data Mining? Explosive Growth of Data Data collection and data availability Automated data collection tools, Internet, smartphones, Major sources of abundant data Business:

More information

Open Access Research on Database Massive Data Processing and Mining Method based on Hadoop Cloud Platform

Open Access Research on Database Massive Data Processing and Mining Method based on Hadoop Cloud Platform Send Orders for Reprints to reprints@benthamscience.ae The Open Automation and Control Systems Journal, 2014, 6, 1463-1467 1463 Open Access Research on Database Massive Data Processing and Mining Method

More information

RECOMMENDATION SYSTEM USING BLOOM FILTER IN MAPREDUCE

RECOMMENDATION SYSTEM USING BLOOM FILTER IN MAPREDUCE RECOMMENDATION SYSTEM USING BLOOM FILTER IN MAPREDUCE Reena Pagare and Anita Shinde Department of Computer Engineering, Pune University M. I. T. College Of Engineering Pune India ABSTRACT Many clients

More information

AUTO CLAIM FRAUD DETECTION USING MULTI CLASSIFIER SYSTEM

AUTO CLAIM FRAUD DETECTION USING MULTI CLASSIFIER SYSTEM AUTO CLAIM FRAUD DETECTION USING MULTI CLASSIFIER SYSTEM ABSTRACT Luis Alexandre Rodrigues and Nizam Omar Department of Electrical Engineering, Mackenzie Presbiterian University, Brazil, São Paulo 71251911@mackenzie.br,nizam.omar@mackenzie.br

More information

A Comparative Study of clustering algorithms Using weka tools

A Comparative Study of clustering algorithms Using weka tools A Comparative Study of clustering algorithms Using weka tools Bharat Chaudhari 1, Manan Parikh 2 1,2 MECSE, KITRC KALOL ABSTRACT Data clustering is a process of putting similar data into groups. A clustering

More information

SIGMOD RWE Review Towards Proximity Pattern Mining in Large Graphs

SIGMOD RWE Review Towards Proximity Pattern Mining in Large Graphs SIGMOD RWE Review Towards Proximity Pattern Mining in Large Graphs Fabian Hueske, TU Berlin June 26, 21 1 Review This document is a review report on the paper Towards Proximity Pattern Mining in Large

More information

A CLOUD-BASED FRAMEWORK FOR ONLINE MANAGEMENT OF MASSIVE BIMS USING HADOOP AND WEBGL

A CLOUD-BASED FRAMEWORK FOR ONLINE MANAGEMENT OF MASSIVE BIMS USING HADOOP AND WEBGL A CLOUD-BASED FRAMEWORK FOR ONLINE MANAGEMENT OF MASSIVE BIMS USING HADOOP AND WEBGL *Hung-Ming Chen, Chuan-Chien Hou, and Tsung-Hsi Lin Department of Construction Engineering National Taiwan University

More information

Index Terms : Load rebalance, distributed file systems, clouds, movement cost, load imbalance, chunk.

Index Terms : Load rebalance, distributed file systems, clouds, movement cost, load imbalance, chunk. Load Rebalancing for Distributed File Systems in Clouds. Smita Salunkhe, S. S. Sannakki Department of Computer Science and Engineering KLS Gogte Institute of Technology, Belgaum, Karnataka, India Affiliated

More information

A Study of Web Log Analysis Using Clustering Techniques

A Study of Web Log Analysis Using Clustering Techniques A Study of Web Log Analysis Using Clustering Techniques Hemanshu Rana 1, Mayank Patel 2 Assistant Professor, Dept of CSE, M.G Institute of Technical Education, Gujarat India 1 Assistant Professor, Dept

More information

Large-Scale Test Mining

Large-Scale Test Mining Large-Scale Test Mining SIAM Conference on Data Mining Text Mining 2010 Alan Ratner Northrop Grumman Information Systems NORTHROP GRUMMAN PRIVATE / PROPRIETARY LEVEL I Aim Identify topic and language/script/coding

More information