# Research Article Evolutionary Feature Selection for Big Data Classification: A MapReduce Approach

Save this PDF as:

Size: px
Start display at page:

Download "Research Article Evolutionary Feature Selection for Big Data Classification: A MapReduce Approach"

## Transcription

4 4 Mathematical Problems in Engineering Parents Offspring New population HUX HUX Incest prevention HUX Restart population Yes d=0? No Select best If no offspring is generated d d 1 Figure 2: Flowchart of the CHC algorithm. problem, as each feature can be represented as a bit in the solution vector. Thus, each position of the vector indicates if the corresponding feature is selected or not. Therefore, this approach falls within the wrapper method category, according to the classification established in Section 2.1. The fitness function used to evaluate new individuals applies a k- Nearest Neighbors classifier (k-nn) [33]overthedatasetthat wouldbeobtainedafterremovingthecorrespondingfeatures. The fitness value is the weighted sum of the k-nn accuracy and the feature reduction rate MR-EFS Algorithm. This section describes the parallelization of the CHC algorithm, by using a MapReduce procedure to obtain a vector of weights. Let T be a training set, stored in HFDS and randomized as described in [34]. Let m be the number of map tasks. The splitting procedure of MapReduce divides T in m disjoint subsets of instances. Then, each T i subset (i {1, 2,...,m}) is processed by the corresponding Map i task. As this partitioning is performed sequentially, all subsets will have approximately the same number of instances, and the randomization of the T fileensuresanadequatebalanceof the classes. The map phase over each T i consists of the EFS algorithm (inthiscase,basedonchc)asdescribedinsection 3.1. Therefore, the output of each map task is a binary vector f i = {f i1,...,f id },whered isthenumberoffeatures,that indicates which features were selected by the CHC algorithm. The reduce phase averages all the binary vectors, obtaining avectorx as defined in (3), wherex j is the proportion of EFS applications that include the feature j in their result. ThisvectoristheresultoftheoverallEFSprocessandis usedtobuildthereduceddatasetthatwillbeusedforfurther machine learning purposes: x ={x 1,...,x D }, x j = 1 m m i=1 f ij, j {1, 2,...,D}. (3) Original training dataset Initial Map Reduce Final. Mappers training set CHC CHC. CHC Figure 3: Flowchart of MR-EFS algorithm. In the implementation used for the experiments, the reduce phase is carried out by a single task, which reduces the runtime by decreasing the MapReduce overhead [35]. The whole MapReduce process for EFS is depicted in Figure 3. It is noteworthy that the whole procedure is performed within a single iteration of the MapReduce workflow, avoiding additional disk accesses Dataset Reduction with MapReduce. Once vector x is calculated, the objective is to remove the less promising features from the original dataset. To do so in a scalable manner, an additional MapReduce process was designed. First, vectorx is binarized using a threshold θ: b ={b 1,...,b D }, b j = { 1, if x j θ, { 0, otherwise. { Vector b indicates which features will be selected for the reduced dataset. The number of selected features (D = D j=1 b j)canbecontrolledwithθ:withahighthreshold,only a few features will be selected, while a lower threshold allows more features to be picked. (4)

5 Mathematical Problems in Engineering 5 Initial Map Final Original training/test dataset Reduced dataset Mappers training/test set Result of EPS Threshold 0110 Figure 4: Flowchart of the dataset reduction process. The MapReduce process for the dataset reduction works as follows. Each map processes an instance and generates a new one that only contains the features selected in b. Finally, the instances generated are concatenated to form the final reduced dataset, without the need of a reduce step. The dataset reduction process, using the result of MR-EFS and an arbitrary threshold, is depicted in Figure Experimental Framework and Analysis This section describes the performed experiments and their results. First, Section 4.1 describes the datasets and the methods used for the experiments. Section 4.2 details the underlying hardware and software support. Finally, Sections 4.3 and 4.4 present the results obtained using two different datasets Datasets and Methods. This experimental study uses two large binary classification datasets in order to analyze the quality of the solutions provided by the MR-EFS algorithm. First, the epsilon dataset was used, which is composed of instances with 2000 numerical features. This dataset was artificially created for the Pascal Large Scale Learning Challenge [36]in2008.TheversionprovidedbyLIBSVM[37] was used. Additionally, this study includes the dataset used at the data mining competition of the Evolutionary Computation for Big Data and Big Learning held on July 14, 2014, in Vancouver (Canada), under the international conference GECCO-2014 (from now on, it is referred to as ECBDL14) [38]. This dataset has 631 features (including both numerical and categorical attributes), and it is composed of approximately 32 million instances. Moreover, the class distribution is not balanced: 98% of the instances belong to the negative class. In order to deal with the imbalance problem, the MapReduce approach of the Random Oversampling (ROS) algorithm presented in [39] was applied over the original training set for ECBDL14. The aim of ROS is to replicate the minority class instances from the original dataset until the number of instances from both classes is the same. Despite the inconvenience of increasing the size of the dataset, this technique was proven in [39] to yield better performance than other common approaches to deal with imbalance problems, such as undersampling and costsensitive methods. These two approaches suffer from the small sample size problem for the minority class when they are used within a MapReduce model. The main characteristics of these datasets are summarized intable 1. Foreachdataset, thenumberofinstancesforboth training and test sets and the number of attributes are shown, along with the number of splits in which MR-EFS divided each dataset. Note that the imbalanced version of ECBDL14 is not used in the experiments, as only the balanced ECBDL14- ROS version is considered. The parameters for the CHC algorithm are presented in Table 2. After applying MR-EFS over the described datasets, the behavior of the obtained reduced datasets was tested using three different classifiers implemented in Spark, available in MLlib: SVM [40], Logistic Regression [41], and Naive

7 Mathematical Problems in Engineering 7 Table 3: Execution times (in seconds) over the epsilon subsets. Instances Sequential CHC MR-EFS Splits Time (s) Feature Selection Performance. The complexity order of the CHC algorithm is approximately O(n 2 Dp), wherep is thenumberofevaluationsofthefitnessfunction(ak-nn classifier in this case), n is the number of instances, and D is the number of features. Therefore, the algorithm is quadratic with respect to the number of instances. When the dataset is divided into m splits within MR- EFS, each one of the m map tasks has complexity order O((n 2 /m 2 )Dp),whichism 2 times faster than applying CHC over the whole dataset. If n c cores are available for the map tasks, the complexity of the map phase within the MR- EFS procedure is approximately O( m/n c (n 2 /m 2 )Dp). This demonstrates the scalability of the approach presented in this paper: even if the maps are executed sequentially (n c = 1), the procedureisstilloneorderofmagnitudefasterthanfeeding a single CHC with all the instances at once. In order to verify the performance of MR-EFS with respect to the sequential approach, a set of experiments were performed using subsets of the epsilon dataset. Both a sequential CHC algorithm and the parallel MR-EFS (with 1000 instances per split) were applied over those subsets. The obtained execution times are presented in Table 3 and Figure 5, along with the runtime of MR-EFS over the whole dataset. The sequential runtimes described a quadratic shape, in concordance with the complexity order of CHC, which clearly states that the time necessary to tackle the whole dataset would be impractical. In opposite, the runtime of MR-EFS for the small datasets was nearly constant. The case with 1000 instances is particular, in the sense that MR-EFS only executed one map task; therefore, it executed a single CHC with 1000 instances. The time difference between CHC andmr-efsinthiscasereflectstheoverheadintroducedby the latter. Even though his overhead increased slightly as the number of map tasks grew, it represented a minor part of the overall runtime. As for the full dataset, with 512 splits, the number of instancesforeachmaptaskinmr-efsisaround780.asthe number of cores used for the experiments was 240, the map phaseinmr-efsshouldberoughlythreetimesslowerthan the sequential CHC with 1000 instances, according to the complexity orders previously detailed. The times in Table 3 show a higher time gap, because MR-EFS includes as well the otherphasesofthemapreduceframework(namely,splitting, shuffle, and reduce), which are nonnegligible for a dataset of such size. Nevertheless, the overall MR-EFS execution time is Sequential CHC MR-EFS MR-EFS (full dataset) Number of instances Figure 5: Execution times of the sequential CHC and MR-EFS. AUC Features Classifier Logistic Regression Naive Bayes SVM-0.0 SVM-0.5 Set Training Test Figure 6: Accuracy with the epsilon dataset. Note that the number of features decreases as the threshold increases. highly scalable, as shown by the fact that MR-EFS was able to process the instances faster than the time needed by CHC to process 5000 instances Classification Results. This section presents the results obtained by applying several classifiers over the full epsilon dataset with its features previously selected by using MR-EFS. Thedatasetwassplitamong512maptasks,eachofwhich computed around 780 instances. The AUC values are shown in Table 4 and Figure 6,both for training and test sets, using three different thresholds for the dataset reduction step. Note that the zero-threshold corresponds to the original dataset, without performing any

8 8 Mathematical Problems in Engineering Table 4: AUC results for the Spark classifiers using epsilon. Threshold Features Logistic Regression Naive Bayes SVM (λ =0.0) SVM (λ =0.5) Training Test Training Test Training Test Training Test Table 5: Training runtime (in seconds) for the Spark classifiers using epsilon. Threshold Features Logistic Regression Naive Bayes SVM (λ =0.0) SVM (λ =0.5) Table 6: Size of the epsilon dataset for each threshold. Threshold Set MB HDFS blocks 0.00 Training Test Training Test Training Test Training Test Run time (s) featureselection.thebestresultsforeachmethodarestressed in boldface. The table shows that the accuracy was improved by the removal of the adequate features, as the threshold 0.55 allowed for obtaining higher AUC values. The accuracy gain was especially large for SVM. Moreover, more than half of the features were removed, which reduces significantly the size ofthedatasetandthereforethecomplexityoftheresulting classifier. A threshold of value 0.60 also got to improve the accuracy results, while reducing even further the size of the dataset. Finally, for the 0.65 thresholds, only SVM saw its AUC improved. It is also noteworthy that the two variants of SVM obtained the same results for all the tested thresholds. This fact indicates that the complexity of the obtained SVM model is relatively low. The training runtime of the different algorithms and databases is shown in Table 5 and Figure 7. Theobtained results were seemingly the opposite to the expectations: except for Naive Bayes, the classifiers needed more time to process the datasets as their number of features decreases. However, this behavior can be explained: as the dataset gets smaller, it occupies less HDFS blocks, and therefore the full parallel capacity of the cluster is not exploited. The size of each version of the epsilon dataset and the number of HDFS blocks that are needed to store it are shown in Table 6. The computer cluster is composed of 20 machines; therefore, when the number of blocks is lower than 20, some Features Classifier Logistic Regression Naive Bayes SVM-0.0 SVM-0.5 Figure 7: Training runtime with the epsilon dataset. Note that the number of features decreases as the threshold increases. ofthemachinesremainidlebecausetheyhavenohdfs blocktoprocess.moreover,evenifthenumberofblocksis slightly above 20, the affected HDFS blocks may not be evenly distributed among the computing nodes. This demonstrates the capacity of Spark to deal with big databases: as the size of the database (more concretely, the number of HDFS blocks) increases, the framework is able to distribute the processes more evenly, exploiting data locality, increasing the parallelism, and reducing the network overhead. In order to deal with this problem, the same experiments were repeated over the epsilon dataset, after reorganizing the files with a smaller block size. For each of the eight sets, the block size S b was calculated according to (6), wheres is the size of the dataset in bytes and n c isthenumberofcoresin the cluster: S b = 2 K, s K= log 2. n c (6)

9 Mathematical Problems in Engineering 9 Table 7: Training runtime (in seconds) for the Spark classifiers using epsilon with customized block size. Threshold Features Logistic Regression Naive Bayes SVM (λ =0.0) SVM (λ =0.5) Table 8: AUC values for the Spark classifiers using ECBDL14-ROS. Threshold Features LogisticRegression Naive Bayes SVM (λ =0.0) SVM (λ =0.5) Training Test Training Test Training Test Training Test Table 9: Training runtime (in seconds) for the Spark classifiers using ECBDL14-ROS. Threshold Features Logistic Regression Naive Bayes SVM (λ =0.0) SVM (λ =0.5) The runtime for the dataset with the block size customized for each subset is displayed in Table 7 and Figure 8. It is observed that the runtime was smaller than that with the default block size. Furthermore, the curves show the expected behavior: as the number of features of the dataset was reduced, the runtime decreased. In the extreme case (for threshold 0.65), the runtime increased again, because with such a small dataset the synchronization times of Spark become bigger than the computing times, even with the customized block size. In the next section, MR-EFS is tested over a very large dataset, validating these observations Experiments with the ECBDL14-ROS Dataset. This section presents the classification accuracy and runtime results obtained with the ECBDL14-ROS dataset. As described in Section 4.1, arandomoversamplingtechnique[39] was previously applied over the original ECBDL14 dataset to overcome the problems originated by its imbalance. The MR- EFS method was applied using map tasks; therefore, each map task computed around 1984 instances. The obtained results in terms of accuracy are depicted in Table 8 and Figure 9. MR-EFS improved the results in all cases, with different thresholds. The accuracy gain was especially important for the Logistic Regression and SVM algorithms. As expected, the SVM with λ = 0.0 obtained better results than the one with λ=0.5, as the latter attempted to reduce the complexity of the obtained model. However, it is noteworthy that, for 119 features, SVM-0.5 was able to outperform SVM-0.0. This hints that after removing noisy features, the simpler obtained models represented better the Run time (s) Features Classifier Logistic Regression Naive Bayes SVM-0.0 SVM-0.5 Figure 8: Training runtime with the epsilon dataset with customized block size. Note that the number of features decreases as the threshold increases. true knowledge underlying the data. With even less features, both variants of SVM obtained roughly the same accuracy. To conclude this study, the runtime necessary to train the classifiers with all variants of ECBDL14-ROS is presented in Table 9 and Figure 10. Inthiscase,theruntimebehaved as expected: the time was roughly linear with respect to the number of features, for all tested models. This means that MR- EFS was able to improve both the runtime and the accuracy for all those classifiers.

10 10 Mathematical Problems in Engineering AUC Classifier Logistic Regression Naive Bayes SVM-0.0 SVM Features Set Training Test Figure 9: Accuracy with the ECBDL14-ROS dataset. Note that the number of features decreases as the threshold increases. Run time (s) Classifier Logistic Regression Naive Bayes Features SVM-0.0 SVM-0.5 Figure 10: Training runtime with the ECBDL14-ROS dataset. Note that the number of features decreases as the threshold increases. 5. Concluding Remarks This paper presents MR-EFS, an Evolutionary Feature SelectionalgorithmdesignedupontheMapReduceparadigm, intended to preprocess big datasets so that they become affordable for other machine learning techniques, such as classification techniques, that are currently not scalable enough to deal with such datasets. The algorithm has been implemented using Apache Hadoop, and it has been applied over two different large datasets. The resulting reduced datasets have been tested using three different classifiers, implementedinapachespark,overaclusterof20computers. The theoretical evaluation of the model highlights the full scalability of MR-EFS with respect to the number of features in the dataset, in comparison with a sequential approach. This behavior has been further confirmed after the empirical procedures. According to the obtained classification results, it can be claimedthatmr-efsisabletoreduceadequatelythenumber of features of large datasets, leading to reduced versions of them, that are at the same time smaller to store, faster to compute, and easier to classify. These facts have been observed with the two different datasets and for all tested classifiers. For the epsilon dataset, the relation between the reduced datasets size and the number of nodes is forced to modify the HDFS block size, proving that the hardware resources canbeoptimallyusedbyhadoopandspark,withthecorrect design. One of the obtained reduced ECDBL14-ROS datasets, with more than 67 million instances and several hundred features, could be processed by the classifiers in less than half of the time than that of the original dataset, and with an improvement of around 5% in terms of AUC. Conflict of Interests The authors declare that there is no conflict of interests regarding the publication of this paper. Acknowledgments This work is supported by the Research Projects TIN P, P10-TIC-6858, P11-TIC-7765, P12-TIC-2958, and TIN P. D. Peralta and S. Ramírez-Gallego hold two FPU scholarships from the Spanish Ministry of Education and Science (FPU12/04902, FPU13/00047). I. Triguero holds a BOF postdoctoral fellowship from the Ghent University. References [1] E. Alpaydin, Introduction to Machine Learning, MIT Press, Cambridge, Mass, USA, 2nd edition, [2] M.Minelli,M.Chambers,andA.Dhiraj,Big Data, Big Analytics: Emerging Business Intelligence and Analytic Trends for Today s Businesses (Wiley CIO), Wiley, 1st edition, [3] V. Marx, The big challenges of big data, Nature, vol. 498, no. 7453, pp , [4] A. Fernández, S. del Río,V.López et al., Big data with cloud computing: an insight on the computing environment, MapReduce, and programming frameworks, Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, vol.4,no.5, pp ,2014. [5] E.Merelli,M.Pettini,andM.Rasetti, Topologydrivenmodeling: the IS metaphor, Natural Computing, [6] S. Sakr, A. Liu, D. M. Batista, and M. Alomari, A survey of large scale data management approaches in cloud environments, IEEE Communications Surveys and Tutorials, vol.13,no.3,pp , [7] J. Bacardit and X. Llorà, Large-scale data mining using genetics-based machine learning, Wiley Interdisciplinary

11 Mathematical Problems in Engineering 11 Reviews: Data Mining and Knowledge Discovery, vol.3,no.1, pp.37 61,2013. [8] J. Dean and S. Ghemawat, MapReduce: simplified data processing on large clusters, Communications of the ACM, vol.51,no. 1, pp , [9] J. Dean and S. Ghemawat, Map reduce: a flexible data processing tool, Communications of the ACM, vol. 53, no. 1, pp , [10] S. Ghemawat, H. Gobioff, and S.-T. Leung, The google file system, in Proceedings of the 19th ACM Symposium on Operating Systems Principles (SOSP 03), pp , October [11] M. Snir and S. Otto, MPI The Complete Reference: The MPI Core, MIT Press, Boston, Mass, USA, [12] W. Zhao, H. Ma, and Q. He, Parallel k-means clustering based on MapReduce, in Cloud Computing, M. Jaatun, G. Zhao, and C. Rong, Eds., vol of Lecture Notes in Computer Science, pp , Springer, Berlin, Germany, [13] A. Srinivasan, T. A. Faruquie, and S. Joshi, Data and task parallelism in ILP using MapReduce, Machine Learning, vol. 86,no.1,pp ,2012. [14] M. Zaharia, M. Chowdhury, T. Das et al., Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing, in Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation, pp.1 14, USENIX Association, [15]K.Thomas,C.Grier,J.Ma,V.Paxson,andD.Song, Design and evaluation of a real-time URL spam filtering service, in Proceedings of the IEEE Symposium on Security and Privacy (SP 11),pp ,Berkeley,Calif,USA,May2011. [16] R. S. Xin, J. Rosen, M. Zaharia, M. J. Franklin, S. Shenker, and I. Stoica, Shark: SQL and rich analytics at scale, in Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD 13), pp , ACM, June [17] M. S. Wiewiorka, A. Messina, A. Pacholewska, S. Maffioletti, P. Gawrysiak, and M. J. Okoniewski, SparkSeq: fast, scalable and cloud-ready tool for the interactive genomic data analysis with nucleotide precision, Bioinformatics, vol.30,no.18,pp , [18] S. García, J. Luengo, and F. Herrera, Data Preprocessing in Data Mining, Springer, 1st edition, [19] H. Brighton and C. Mellish, Advances in instance selection for instance-based learning algorithms, Data Mining and Knowledge Discovery,vol.6,no.2,pp ,2002. [20] J. A. Olvera-López, J. A. Carrasco-Ochoa, J. F. Martínez- Trinidad, and J. Kittler, A review of instance selection methods, Artificial Intelligence Review,vol.34,no.2,pp ,2010. [21] J. S. Sánchez, High training set size reduction by space partitioning and prototype abstraction, Pattern Recognition, vol.37,no.7,pp ,2004. [22] I. Guyon and A. Elisseeff, An introduction to variable and feature selection, Machine Learning Research,vol.3, pp , [23] H. Liu and L. Yu, Toward integrating feature selection algorithms for classification and clustering, IEEE Transactions on Knowledge and Data Engineering, vol.17,no.4,pp , [24] Y. Saeys, I. Inza, and P. Larrañaga, A review of feature selection techniques in bioinformatics, Bioinformatics,vol.23,no.19,pp , [25] H. Liu and H. Motoda, Computational Methods of Feature Selection, Chapman & Hall/CRC Data Mining and Knowledge Discovery Series, Chapman & Hall/CRC Press, [26] J. A. Lee and M. Verleysen, Nonlinear Dimensionality Reduction, Springer, [27] B. de la Iglesia, Evolutionary computation for feature selection in classification problems, Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery,vol.3,no.6,pp , [28] L. J. Eshelman, The CHC adaptative search algorithm: how to have safe search when engaging in nontraditional genetic recombination, in Foundations of Genetic Algorithms, G. J. E. Rawlins, Ed., pp , Morgan Kaufmann, San Mateo, Calif, USA, [29] MLlib Website, https://spark.apache.org/mllib/. [30] R. Kohavi and G. H. John, Wrappers for feature subset selection, Artificial Intelligence, vol.97,no.1-2,pp , [31] M. Tan, I. W. Tsang, and L. Wang, Towards ultrahigh dimensional feature selection for big data, Machine Learning Research,vol.15,pp ,2014. [32] T. White, Hadoop: The Definitive Guide, O ReillyMedia, 3rd edition, [33] T. Cover and P. Hart, Nearest neighbor pattern classification, IEEE Transactions on Information Theory, vol.13,no.1,pp.21 27, [34] I. Triguero, D. Peralta, J. Bacardit, S. García, and F. Herrera, MRPR: a MapReduce solution for prototype reduction in big data classification, Neurocomputing,vol.150,pp ,2015. [35] C.-T. Chu, S. Kim, Y.-A. Lin et al., Map-reduce for machine learning on multicore, in Advances in Neural Information Processing Systems, pp , [36] Pascal large scale learning challenge, [37] Epsilon in the LIBSVM website, cjlin/libsvmtools/datasets/binary.html#epsilon. [38] ECBDL14 dataset: Protein structure prediction and contact map for the ECBDL2014 Big Data Competition, [39] S. del Río,V.López,J.M.Benítez, and F. Herrera, On the use of MapReduce for imbalanced big data using Random Forest, Information Sciences,vol.285,pp ,2014. [40] M. A. Hearst, S. T. Dumais, E. Osman, J. Platt, and B. Scholkopf, Support vector machines, IEEE Intelligent Systems and Their Applications, vol. 13, no. 4, pp , [41] D. W. Hosmer Jr. and S. Lemeshow, Applied Logistic Regression, John Wiley & Sons, [42]R.O.DudaandP.E.Hart,Pattern Classification and Scene Analysis,vol.3,Wiley,NewYork,NY,USA,1973. [43] MLlib guide,

12 Advances in Operations Research Advances in Decision Sciences Applied Mathematics Algebra Probability and Statistics The Scientific World Journal International Differential Equations Submit your manuscripts at International Advances in Combinatorics Mathematical Physics Complex Analysis International Mathematics and Mathematical Sciences Mathematical Problems in Engineering Mathematics Discrete Mathematics Discrete Dynamics in Nature and Society Function Spaces Abstract and Applied Analysis International Stochastic Analysis Optimization

### Large-Scale Data Sets Clustering Based on MapReduce and Hadoop

Journal of Computational Information Systems 7: 16 (2011) 5956-5963 Available at http://www.jofcis.com Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Ping ZHOU, Jingsheng LEI, Wenjun YE

### Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Introduction to Big Data! with Apache Spark" UC#BERKELEY# This Lecture" The Big Data Problem" Hardware for Big Data" Distributing Work" Handling Failures and Slow Machines" Map Reduce and Complex Jobs"

### Keywords: Big Data, HDFS, Map Reduce, Hadoop

Volume 5, Issue 7, July 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Configuration Tuning

### The Performance Characteristics of MapReduce Applications on Scalable Clusters

The Performance Characteristics of MapReduce Applications on Scalable Clusters Kenneth Wottrich Denison University Granville, OH 43023 wottri_k1@denison.edu ABSTRACT Many cluster owners and operators have

### Efficient Data Replication Scheme based on Hadoop Distributed File System

, pp. 177-186 http://dx.doi.org/10.14257/ijseia.2015.9.12.16 Efficient Data Replication Scheme based on Hadoop Distributed File System Jungha Lee 1, Jaehwa Chung 2 and Daewon Lee 3* 1 Division of Supercomputing,

### Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Hadoop MapReduce and Spark Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015 Outline Hadoop Hadoop Import data on Hadoop Spark Spark features Scala MLlib MLlib

### An Overview of Knowledge Discovery Database and Data mining Techniques

An Overview of Knowledge Discovery Database and Data mining Techniques Priyadharsini.C 1, Dr. Antony Selvadoss Thanamani 2 M.Phil, Department of Computer Science, NGM College, Pollachi, Coimbatore, Tamilnadu,

### An analysis of suitable parameters for efficiently applying K-means clustering to large TCPdump data set using Hadoop framework

An analysis of suitable parameters for efficiently applying K-means clustering to large TCPdump data set using Hadoop framework Jakrarin Therdphapiyanak Dept. of Computer Engineering Chulalongkorn University

### Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Outline High Performance Computing (HPC) Towards exascale computing: a brief history Challenges in the exascale era Big Data meets HPC Some facts about Big Data Technologies HPC and Big Data converging

### Analysis of MapReduce Algorithms

Analysis of MapReduce Algorithms Harini Padmanaban Computer Science Department San Jose State University San Jose, CA 95192 408-924-1000 harini.gomadam@gmail.com ABSTRACT MapReduce is a programming model

### CSE-E5430 Scalable Cloud Computing Lecture 2

CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 14.9-2015 1/36 Google MapReduce A scalable batch processing

### A tour on big data classification: Learning algorithms, Feature selection, and Imbalanced Classes

A tour on big data classification: Learning algorithms, Feature selection, and Imbalanced Classes Francisco Herrera Research Group on Soft Computing and Information Intelligent Systems (SCI 2 S) Dept.

### Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after

### Research on Clustering Analysis of Big Data Yuan Yuanming 1, 2, a, Wu Chanle 1, 2

Advanced Engineering Forum Vols. 6-7 (2012) pp 82-87 Online: 2012-09-26 (2012) Trans Tech Publications, Switzerland doi:10.4028/www.scientific.net/aef.6-7.82 Research on Clustering Analysis of Big Data

### FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data

FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data Miguel Liroz-Gistau, Reza Akbarinia, Patrick Valduriez To cite this version: Miguel Liroz-Gistau, Reza Akbarinia, Patrick Valduriez. FP-Hadoop:

### A Hybrid Scheduling Approach for Scalable Heterogeneous Hadoop Systems

A Hybrid Scheduling Approach for Scalable Heterogeneous Hadoop Systems Aysan Rasooli Department of Computing and Software McMaster University Hamilton, Canada Email: rasooa@mcmaster.ca Douglas G. Down

### What is Analytic Infrastructure and Why Should You Care?

What is Analytic Infrastructure and Why Should You Care? Robert L Grossman University of Illinois at Chicago and Open Data Group grossman@uic.edu ABSTRACT We define analytic infrastructure to be the services,

### Distributed Framework for Data Mining As a Service on Private Cloud

RESEARCH ARTICLE OPEN ACCESS Distributed Framework for Data Mining As a Service on Private Cloud Shraddha Masih *, Sanjay Tanwani** *Research Scholar & Associate Professor, School of Computer Science &

### Analysing Large Web Log Files in a Hadoop Distributed Cluster Environment

Analysing Large Files in a Hadoop Distributed Cluster Environment S Saravanan, B Uma Maheswari Department of Computer Science and Engineering, Amrita School of Engineering, Amrita Vishwa Vidyapeetham,

### BIG DATA What it is and how to use?

BIG DATA What it is and how to use? Lauri Ilison, PhD Data Scientist 21.11.2014 Big Data definition? There is no clear definition for BIG DATA BIG DATA is more of a concept than precise term 1 21.11.14

### RevoScaleR Speed and Scalability

EXECUTIVE WHITE PAPER RevoScaleR Speed and Scalability By Lee Edlefsen Ph.D., Chief Scientist, Revolution Analytics Abstract RevoScaleR, the Big Data predictive analytics library included with Revolution

### Fault Tolerance in Hadoop for Work Migration

1 Fault Tolerance in Hadoop for Work Migration Shivaraman Janakiraman Indiana University Bloomington ABSTRACT Hadoop is a framework that runs applications on large clusters which are built on numerous

### Introduction to Hadoop and MapReduce

Introduction to Hadoop and MapReduce THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION Large-scale Computation Traditional solutions for computing large quantities of data

### Classification On The Clouds Using MapReduce

Classification On The Clouds Using MapReduce Simão Martins Instituto Superior Técnico Lisbon, Portugal simao.martins@tecnico.ulisboa.pt Cláudia Antunes Instituto Superior Técnico Lisbon, Portugal claudia.antunes@tecnico.ulisboa.pt

### Optimization and analysis of large scale data sorting algorithm based on Hadoop

Optimization and analysis of large scale sorting algorithm based on Hadoop Zhuo Wang, Longlong Tian, Dianjie Guo, Xiaoming Jiang Institute of Information Engineering, Chinese Academy of Sciences {wangzhuo,

### The Evolvement of Big Data Systems

The Evolvement of Big Data Systems From the Perspective of an Information Security Application 2015 by Gang Chen, Sai Wu, Yuan Wang presented by Slavik Derevyanko Outline Authors and Netease Introduction

Radoop: Analyzing Big Data with RapidMiner and Hadoop Zoltán Prekopcsák, Gábor Makrai, Tamás Henk, Csaba Gáspár-Papanek Budapest University of Technology and Economics, Hungary Abstract Working with large

### Chapter 7. Using Hadoop Cluster and MapReduce

Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in

### R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5

Distributed data processing in heterogeneous cloud environments R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5 1 uskenbaevar@gmail.com, 2 abu.kuandykov@gmail.com,

### Map/Reduce Affinity Propagation Clustering Algorithm

Map/Reduce Affinity Propagation Clustering Algorithm Wei-Chih Hung, Chun-Yen Chu, and Yi-Leh Wu Department of Computer Science and Information Engineering, National Taiwan University of Science and Technology,

With Saurabh Singh singh.903@osu.edu The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials

### ClusterOSS: a new undersampling method for imbalanced learning

1 ClusterOSS: a new undersampling method for imbalanced learning Victor H Barella, Eduardo P Costa, and André C P L F Carvalho, Abstract A dataset is said to be imbalanced when its classes are disproportionately

### Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give

### Intro to Map/Reduce a.k.a. Hadoop

Intro to Map/Reduce a.k.a. Hadoop Based on: Mining of Massive Datasets by Ra jaraman and Ullman, Cambridge University Press, 2011 Data Mining for the masses by North, Global Text Project, 2012 Slides by

### A Study on Workload Imbalance Issues in Data Intensive Distributed Computing

A Study on Workload Imbalance Issues in Data Intensive Distributed Computing Sven Groot 1, Kazuo Goda 1, and Masaru Kitsuregawa 1 University of Tokyo, 4-6-1 Komaba, Meguro-ku, Tokyo 153-8505, Japan Abstract.

### Data Mining - Evaluation of Classifiers

Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010

### Index Terms : Load rebalance, distributed file systems, clouds, movement cost, load imbalance, chunk.

Load Rebalancing for Distributed File Systems in Clouds. Smita Salunkhe, S. S. Sannakki Department of Computer Science and Engineering KLS Gogte Institute of Technology, Belgaum, Karnataka, India Affiliated

### Facilitating Consistency Check between Specification and Implementation with MapReduce Framework

Facilitating Consistency Check between Specification and Implementation with MapReduce Framework Shigeru KUSAKABE, Yoichi OMORI, and Keijiro ARAKI Grad. School of Information Science and Electrical Engineering,

### Predict Influencers in the Social Network

Predict Influencers in the Social Network Ruishan Liu, Yang Zhao and Liuyu Zhou Email: rliu2, yzhao2, lyzhou@stanford.edu Department of Electrical Engineering, Stanford University Abstract Given two persons

### From GWS to MapReduce: Google s Cloud Technology in the Early Days

Large-Scale Distributed Systems From GWS to MapReduce: Google s Cloud Technology in the Early Days Part II: MapReduce in a Datacenter COMP6511A Spring 2014 HKUST Lin Gu lingu@ieee.org MapReduce/Hadoop

IMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAP-REDUCE Mr. Santhosh S 1, Mr. Hemanth Kumar G 2 1 PG Scholor, 2 Asst. Professor, Dept. Of Computer Science & Engg, NMAMIT, (India) ABSTRACT

### COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Big Data by the numbers

COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Instructor: (jpineau@cs.mcgill.ca) TAs: Pierre-Luc Bacon (pbacon@cs.mcgill.ca) Ryan Lowe (ryan.lowe@mail.mcgill.ca)

### Advances in Natural and Applied Sciences

AENSI Journals Advances in Natural and Applied Sciences ISSN:1995-0772 EISSN: 1998-1090 Journal home page: www.aensiweb.com/anas Clustering Algorithm Based On Hadoop for Big Data 1 Jayalatchumy D. and

### Distributed Computing and Big Data: Hadoop and MapReduce

Distributed Computing and Big Data: Hadoop and MapReduce Bill Keenan, Director Terry Heinze, Architect Thomson Reuters Research & Development Agenda R&D Overview Hadoop and MapReduce Overview Use Case:

### http://www.wordle.net/

Hadoop & MapReduce http://www.wordle.net/ http://www.wordle.net/ Hadoop is an open-source software framework (or platform) for Reliable + Scalable + Distributed Storage/Computational unit Failures completely

### MapReduce and Hadoop Distributed File System

MapReduce and Hadoop Distributed File System 1 B. RAMAMURTHY Contact: Dr. Bina Ramamurthy CSE Department University at Buffalo (SUNY) bina@buffalo.edu http://www.cse.buffalo.edu/faculty/bina Partially

### RECOMMENDATION SYSTEM USING BLOOM FILTER IN MAPREDUCE

RECOMMENDATION SYSTEM USING BLOOM FILTER IN MAPREDUCE Reena Pagare and Anita Shinde Department of Computer Engineering, Pune University M. I. T. College Of Engineering Pune India ABSTRACT Many clients

### A Performance Analysis of Distributed Indexing using Terrier

A Performance Analysis of Distributed Indexing using Terrier Amaury Couste Jakub Kozłowski William Martin Indexing Indexing Used by search

### A Cost-Benefit Analysis of Indexing Big Data with Map-Reduce

A Cost-Benefit Analysis of Indexing Big Data with Map-Reduce Dimitrios Siafarikas Argyrios Samourkasidis Avi Arampatzis Department of Electrical and Computer Engineering Democritus University of Thrace

### GraySort and MinuteSort at Yahoo on Hadoop 0.23

GraySort and at Yahoo on Hadoop.23 Thomas Graves Yahoo! May, 213 The Apache Hadoop[1] software library is an open source framework that allows for the distributed processing of large data sets across clusters

### Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Distributed File System 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributed File System Don t move data to workers move workers to the data! Store data on the local disks of nodes

### Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop Kanchan A. Khedikar Department of Computer Science & Engineering Walchand Institute of Technoloy, Solapur, Maharashtra,

### CSE-E5430 Scalable Cloud Computing Lecture 11

CSE-E5430 Scalable Cloud Computing Lecture 11 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 30.11-2015 1/24 Distributed Coordination Systems Consensus

### CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model

### Distributed forests for MapReduce-based machine learning

Distributed forests for MapReduce-based machine learning Ryoji Wakayama, Ryuei Murata, Akisato Kimura, Takayoshi Yamashita, Yuji Yamauchi, Hironobu Fujiyoshi Chubu University, Japan. NTT Communication

### Knowledge Discovery from patents using KMX Text Analytics

Knowledge Discovery from patents using KMX Text Analytics Dr. Anton Heijs anton.heijs@treparel.com Treparel Abstract In this white paper we discuss how the KMX technology of Treparel can help searchers

### Research on Job Scheduling Algorithm in Hadoop

Journal of Computational Information Systems 7: 6 () 5769-5775 Available at http://www.jofcis.com Research on Job Scheduling Algorithm in Hadoop Yang XIA, Lei WANG, Qiang ZHAO, Gongxuan ZHANG School of

### A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS Dr. Ananthi Sheshasayee 1, J V N Lakshmi 2 1 Head Department of Computer Science & Research, Quaid-E-Millath Govt College for Women, Chennai, (India)

### A Performance Study of Load Balancing Strategies for Approximate String Matching on an MPI Heterogeneous System Environment

A Performance Study of Load Balancing Strategies for Approximate String Matching on an MPI Heterogeneous System Environment Panagiotis D. Michailidis and Konstantinos G. Margaritis Parallel and Distributed

### 16.1 MAPREDUCE. For personal use only, not for distribution. 333

For personal use only, not for distribution. 333 16.1 MAPREDUCE Initially designed by the Google labs and used internally by Google, the MAPREDUCE distributed programming model is now promoted by several

### Mining Large Datasets: Case of Mining Graph Data in the Cloud

Mining Large Datasets: Case of Mining Graph Data in the Cloud Sabeur Aridhi PhD in Computer Science with Laurent d Orazio, Mondher Maddouri and Engelbert Mephu Nguifo 16/05/2014 Sabeur Aridhi Mining Large

### Research Article Hadoop-Based Distributed Sensor Node Management System

Distributed Networks, Article ID 61868, 7 pages http://dx.doi.org/1.1155/214/61868 Research Article Hadoop-Based Distributed Node Management System In-Yong Jung, Ki-Hyun Kim, Byong-John Han, and Chang-Sung

### Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques

Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques Subhashree K 1, Prakash P S 2 1 Student, Kongu Engineering College, Perundurai, Erode 2 Assistant Professor,

Introduction to Hadoop 1 What is Hadoop? the big data revolution extracting value from data cloud computing 2 Understanding MapReduce the word count problem more examples MCS 572 Lecture 24 Introduction

### Survey on Scheduling Algorithm in MapReduce Framework

Survey on Scheduling Algorithm in MapReduce Framework Pravin P. Nimbalkar 1, Devendra P.Gadekar 2 1,2 Department of Computer Engineering, JSPM s Imperial College of Engineering and Research, Pune, India

### processed parallely over the cluster nodes. Mapreduce thus provides a distributed approach to solve complex and lengthy problems

Big Data Clustering Using Genetic Algorithm On Hadoop Mapreduce Nivranshu Hans, Sana Mahajan, SN Omkar Abstract: Cluster analysis is used to classify similar objects under same group. It is one of the

MapReduce and Implementation Hadoop Parallel Data Processing Kai Shen A programming interface (two stage Map and Reduce) and system support such that: the interface is easy to program, and suitable for

### Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms

Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms Scott Pion and Lutz Hamel Abstract This paper presents the results of a series of analyses performed on direct mail

### Accelerating Hadoop MapReduce Using an In-Memory Data Grid

Accelerating Hadoop MapReduce Using an In-Memory Data Grid By David L. Brinker and William L. Bain, ScaleOut Software, Inc. 2013 ScaleOut Software, Inc. 12/27/2012 H adoop has been widely embraced for

### Random forest algorithm in big data environment

Random forest algorithm in big data environment Yingchun Liu * School of Economics and Management, Beihang University, Beijing 100191, China Received 1 September 2014, www.cmnt.lv Abstract Random forest

### Guidelines for Selecting Hadoop Schedulers based on System Heterogeneity

Noname manuscript No. (will be inserted by the editor) Guidelines for Selecting Hadoop Schedulers based on System Heterogeneity Aysan Rasooli Douglas G. Down Received: date / Accepted: date Abstract Hadoop

### MapReduce Approach to Collective Classification for Networks

MapReduce Approach to Collective Classification for Networks Wojciech Indyk 1, Tomasz Kajdanowicz 1, Przemyslaw Kazienko 1, and Slawomir Plamowski 1 Wroclaw University of Technology, Wroclaw, Poland Faculty

### Big Data with Rough Set Using Map- Reduce

Big Data with Rough Set Using Map- Reduce Mr.G.Lenin 1, Mr. A. Raj Ganesh 2, Mr. S. Vanarasan 3 Assistant Professor, Department of CSE, Podhigai College of Engineering & Technology, Tirupattur, Tamilnadu,

### Challenges for Data Driven Systems

Challenges for Data Driven Systems Eiko Yoneki University of Cambridge Computer Laboratory Quick History of Data Management 4000 B C Manual recording From tablets to papyrus to paper A. Payberah 2014 2

### Jeffrey D. Ullman slides. MapReduce for data intensive computing

Jeffrey D. Ullman slides MapReduce for data intensive computing Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very

### Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011

Scalable Data Analysis in R Lee E. Edlefsen Chief Scientist UserR! 2011 1 Introduction Our ability to collect and store data has rapidly been outpacing our ability to analyze it We need scalable data analysis

### Support Vector Machines with Clustering for Training with Very Large Datasets

Support Vector Machines with Clustering for Training with Very Large Datasets Theodoros Evgeniou Technology Management INSEAD Bd de Constance, Fontainebleau 77300, France theodoros.evgeniou@insead.fr Massimiliano

### Machine Learning Big Data using Map Reduce

Machine Learning Big Data using Map Reduce By Michael Bowles, PhD Where Does Big Data Come From? -Web data (web logs, click histories) -e-commerce applications (purchase histories) -Retail purchase histories

### Scalable Cloud Computing Solutions for Next Generation Sequencing Data

Scalable Cloud Computing Solutions for Next Generation Sequencing Data Matti Niemenmaa 1, Aleksi Kallio 2, André Schumacher 1, Petri Klemelä 2, Eija Korpelainen 2, and Keijo Heljanko 1 1 Department of

### A Framework for Performance Analysis and Tuning in Hadoop Based Clusters

A Framework for Performance Analysis and Tuning in Hadoop Based Clusters Garvit Bansal Anshul Gupta Utkarsh Pyne LNMIIT, Jaipur, India Email: [garvit.bansal anshul.gupta utkarsh.pyne] @lnmiit.ac.in Manish

### Processing Large Amounts of Images on Hadoop with OpenCV

Processing Large Amounts of Images on Hadoop with OpenCV Timofei Epanchintsev 1,2 and Andrey Sozykin 1,2 1 IMM UB RAS, Yekaterinburg, Russia, 2 Ural Federal University, Yekaterinburg, Russia {eti,avs}@imm.uran.ru

### Dell In-Memory Appliance for Cloudera Enterprise

Dell In-Memory Appliance for Cloudera Enterprise Hadoop Overview, Customer Evolution and Dell In-Memory Product Details Author: Armando Acosta Hadoop Product Manager/Subject Matter Expert Armando_Acosta@Dell.com/

Adaptive Task Scheduling for MultiJob MapReduce Environments Jordà Polo, David de Nadal, David Carrera, Yolanda Becerra, Vicenç Beltran, Jordi Torres and Eduard Ayguadé Barcelona Supercomputing Center

### Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data

Case Study 2: Document Retrieval Parallel Programming Map-Reduce Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Carlos Guestrin January 31 st, 2013 Carlos Guestrin

### Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

Big Data Processing, 2014/15 Lecture 5: GFS & HDFS!! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind

Hadoop Architecture Part 1 Node, Rack and Cluster: A node is simply a computer, typically non-enterprise, commodity hardware for nodes that contain data. Consider we have Node 1.Then we can add more nodes,

### Big Data Rethink Algos and Architecture. Scott Marsh Manager R&D Personal Lines Auto Pricing

Big Data Rethink Algos and Architecture Scott Marsh Manager R&D Personal Lines Auto Pricing Agenda History Map Reduce Algorithms History Google talks about their solutions to their problems Map Reduce:

CONTEXTUAL ADVERTISEMENT MINING BASED ON BIG DATA ANALYTICS A.Divya *1, A.M.Saravanan *2, I. Anette Regina *3 MPhil, Research Scholar, Muthurangam Govt. Arts College, Vellore, Tamilnadu, India Assistant

### Analysis and Modeling of MapReduce s Performance on Hadoop YARN

Analysis and Modeling of MapReduce s Performance on Hadoop YARN Qiuyi Tang Dept. of Mathematics and Computer Science Denison University tang_j3@denison.edu Dr. Thomas C. Bressoud Dept. of Mathematics and

### Analysis and Optimization of Massive Data Processing on High Performance Computing Architecture

Analysis and Optimization of Massive Data Processing on High Performance Computing Architecture He Huang, Shanshan Li, Xiaodong Yi, Feng Zhang, Xiangke Liao and Pan Dong School of Computer Science National

### DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES

DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES Vijayalakshmi Mahanra Rao 1, Yashwant Prasad Singh 2 Multimedia University, Cyberjaya, MALAYSIA 1 lakshmi.mahanra@gmail.com

### Big Data Analytics Hadoop and Spark

Big Data Analytics Hadoop and Spark Shelly Garion, Ph.D. IBM Research Haifa 1 What is Big Data? 2 What is Big Data? Big data usually includes data sets with sizes beyond the ability of commonly used software

### Handling Big(ger) Logs: Connecting ProM 6 to Apache Hadoop

Handling Big(ger) Logs: Connecting ProM 6 to Apache Hadoop Sergio Hernández 1, S.J. van Zelst 2, Joaquín Ezpeleta 1, and Wil M.P. van der Aalst 2 1 Department of Computer Science and Systems Engineering

### MapReduce Job Processing

April 17, 2012 Background: Hadoop Distributed File System (HDFS) Hadoop requires a Distributed File System (DFS), we utilize the Hadoop Distributed File System (HDFS). Background: Hadoop Distributed File

### GraySort on Apache Spark by Databricks

GraySort on Apache Spark by Databricks Reynold Xin, Parviz Deyhim, Ali Ghodsi, Xiangrui Meng, Matei Zaharia Databricks Inc. Apache Spark Sorting in Spark Overview Sorting Within a Partition Range Partitioner

### Concept and Project Objectives

3.1 Publishable summary Concept and Project Objectives Proactive and dynamic QoS management, network intrusion detection and early detection of network congestion problems among other applications in the