1 MAPREDUCE FRAMEWORK FOR ANOMALY DETECTION IN MANUFACTURING DATA MISS SIKANA TANUPABRUNGSUN A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF ENGINEERING (COMPUTER ENGINEERING) FACULTY OF ENGINEERING KING MONGKUT S UNIVERSITY OF TECHNOLOGY THONBURI 2013
2 Mapreduce Framework for Anomaly Detection in Manufacturing Data Miss Sikana Tanupabrungsun B.Eng. (Computer Engineering) A Thesis Submited in Partial Fulfillment of the Requirement for the Degree of Master of Engineering (Computer Engineering) Faculty of Engineering King Mongkut s University of Technology Thonburi 2013 Thesis Committee (Asst. Prof. Marong Phadoongsidhi, Ph.D.). (Assoc. Prof. Tiranee Achalakul, Ph.D.). (Santitham Prom-on, Ph.D.). (Watcharapan Suwansantisuk, Ph.D.). ( Li Xiaorong, Ph.D.) Chairman of Thesis Committee Thesis Advisor Committee Committee Committee Copyright reserved
3 i Thesis Title MapReduce Framework for Anomaly Detection in Manufacturing Data Thesis Credits 12 Credits Candidate Miss Sikana Tanupabrungsun Thesis Advisor Assoc. Prof. Dr. Tiranee Achalakul Program Master of Engineering Field of Study Computer Engineering Department Computer Engineering Faculty Engineering Academic Year 2013 Abstract Manufacturing data is an important source of knowledge that can be used to enhance the production capability. Detecting of the causes of defects possibly leads to an improvement in the production. However, the production records generally contain an enormous set of features and the number of alert messages is redundant. Thus, it is almost impossible in practice to monitor all features at once. This research proposes the feature reduction framework, which is designed to identify a subset of informative features and the correlation groups. By monitoring fewer features, the number of alert messages can be decreased. In our methodology, manufacturing data are pre-processed and adopted as inputs. Subsequently, the feature selection process is performed by wrapping Genetic Algorithm (GA) with the k-nearest Neighbor (knn) classifier. To improve the performance, the proposed technique was parallelized with MapReduce. The results showed that the number of features can be reduced by 49.02% with 83.95% accuracy. In addition, with MapReduce on the cloud, the performance was increased by times. The framework result was validated by both statistical method and expert analysts from the manufacturing industry. Keywords : Data analysis/ Feature selection/ Genetic Algorithm (GA)/ k-nearest Neighbor (knn)/ Manufacturing data
4 ii ห วข อว ทยาน พนธ การค ดเล อกกล มพาราม เตอร ท เหมาะสมส าหร บระบบตรวจจ บความผ ดปกต ของสายการผล ตบนแมปร ด วซ หน วยก ต 12 ผ เข ยน นางสาว ศ คณา ธน ภาพร งสรรค อาจารย ท ปร กษา รศ. ดร.ธ รณ อจลาก ล หล กส ตร ว ศวกรรมศาสตรมหาบ ณฑ ต สาขาว ชา ว ศวกรรมคอมพ วเตอร ภาคว ชา ว ศวกรรมคอมพ วเตอร คณะ ว ศวกรรมศาสตร ป การศ กษา 2556 บทค ดย อ ข อม ลสายการผล ตน บว าเป นแหล งความร ส าค ญของธ รก จต างๆในการเพ มประส ทธ ภาพของ สายการผล ต การตรวจจ บสาเหต ความผ ดปกต จ งเป นส วนส าค ญในการพ ฒนาและร กษาค ณภาพของ ผล ตภ ณฑ อย างไรก ตาม ข อม ลสายการผล ตม กประกอบด วยต วแปรเป นจ านวนมาก และระบบ ตรวจจ บความผ ดปกต ม กสร างข อความแจ งเต อนในปร มาณท มากเก นไป ส งผลให การว เคราะห และ ตรวจจ บความผ ดปกต เป นกระบวนการท ย งยากและซ บซ อน งานว จ ยน น าเสนอเฟรมเว ร คการ ค ดเล อกกล มฟ เจอร ท เหมาะสมและม ผลกระทบต อการผล ต กระบวนการว เคราะห ข อม ลเร มจากการ เตร ยมข อม ลสายการผล ตในเบ องต นเพ อจ ดร ปแบบให เหมาะสมและท าการค ดเล อกฟ เจอร โดย ประย กต ใช เจเนต ก อ ลกอร ท ม (Genetic Algorithm) ร วมก บ k-nearest Neighbor (knn) ซ งเป น อ ลกอร ท มการจ ดกล มข อม ล เจเนต ก อ ลกอร ท มท าหน าท ค ดเล อกและน าเสนอกล มของฟ เจอร ท เป นไปได ซ งจะถ กประเม นด วย knn อย างไรก ตาม เน องจากการค ดเล อกกล มฟ เจอร เป น กระบวนการท ต องท าอย างต อเน อง ผ ว จ ยจ งน าอ ลกอร ท มน ไปประมวลผลบน MapReduce เพ อเพ ม ประส ทธ ภาพและลดระยะเวลาในการประมวลผล จากผลการทดลอง เฟรมเว ร คน สามารถลดจ านวน ฟ เจอร ได 49.02% โดยย งร กษาค าความแม นย าท 83.95% และด วยการประมวลผลบน MapReduce เฟรมเว ร คม ความเร วข นถ ง เท า ผลล พธ ของเฟรมเว ร คน ได ร บการร บรองค ณภาพท งด วยว ธ ทาง สถ ต และความค ดเห นจากผ เช ยวชาญท โรงงาน ค าส าค ญ : การว เคราะห ข อม ล/ การค ดเล อกพาราม เตอร / เจเนต ก อ ลกอร ท ม/ k-nearest Neighbor (knn)/ ข อม ลสายการผล ต
5 iii ACKNOWLEDGEMENTS I would like to express my appreciation and thanks to my advisor, Assoc. Prof. Tiranee Achalakul. Without her supervision and constant help, this research could not be possible. I would also like to thank all the CAST Lab members for a great time during my study. I would also like to extend my appreciation to Ms. Vimonrat Wongsathit and her colleagues at Betagro Limited for their help in validating the results and advices. Last but not least, a special thanks to my family for the support and encouragement throughout my study.
6 iv CONTENTS ENGLISH ABSTRACT THAI ABSTRACT ACKNOWLEDGEMENT CONTENTS LIST OF TABLES LIST OF FIGURES LIST OF EQUATIONS PAGE i ii iii iv vi vii viii CHAPTER 1. INTRODUCTION 1 2. LITERATURE REVIEW AND RELATED THEORIES Literature Review Related Theories PROPOSED WORK Framework Design Parallel Algorithm Evaluation Methods DESIGN AND ANALYSIS OF THE EXPERIMENTS Framework Results Framework Accuracy Framework Scalability RESULT DISCUSSION Quality of Solution Subset Validation Adoption Plan 46 CONCLUSION 47 REFERENCES 48
7 v CONTENTS (Cont.) PAGE APPENDIX 51 A. Complete Information of Manufacturing Parameters 51 B. Scalability Result 54 C. PCA Result From the Correlation Matrix 55 D. Validation Result From Expert Analysts (EA) 56 CURRICULUM VITAE 58
8 vi LIST OF TABLES TABLE PAGE 2.1 Feature Selection Category Genetic Algorithm and ABC Comparison of 4 Classifiers Genetic Terms in Context of Computation Correlation Groups Solution Grouped by the Correlation Groups Warning Information Parameters Setting Accuracy From Full Features and Reduced Features T-test Results Datasets Characteristic Accuracy from  and MapReduce GA/kNN General-purpose Machine Profiles List of Clusters Speedup and Efficiency 43
9 vii LIST OF FIGURES FIGURE PAGE 1.1 Manufacturing Process Genetic Algorithm Flowchart Binary Encoding Permutation Encoding Value Encoding N-Point Crossover Uniform Crossover Mutation Nearest Neighbor Time Series Plot Modified Workflow Alert Record Formatted Batch Alert Record Example of Encoded Chromosome Pseudo-code of knn Classifier knn Implementation on Hadoop Plot of Execution time and Instances 42
10 viii LIST OF EQUATIONS EQUATION PAGE 3.1 Chromosome Encoding Classification Accuracy Fitness Function Reduction Rate Number of Instances 40
11 CHAPTER 1 INTRODUCTION Knowledge and solutions to problems are always locked up in the flood of data. However, real world data are often complex and rarely systematical, usually exploited for direct, non-sophisticated observation. Data analytic is a science of raw data analysis, aiming at drawing appropriate conclusion and explanation for the given data. Data mining is a process of discovering knowledge by analyzing the data with different techniques and perspectives, aiming at exploring the patterns and trends of data. It is a method to turn data to knowledge and this has contributed to various businesses for decades. Different techniques have been applied to industries since the 1990s and they have progressed ever since in many different areas . It has also become a topic of great interests to modern researchers. Knowledge is the most valuable asset for any businesses as it provides an opportunity for a business to differentiate itself from competitors. Manufacturing data is widely realized as a source of knowledge to detect the causes of certain defects, which can lead to the potential solutions for product quality control. Food production is one of the manufacturing processes in which quality control should be intensively considered. Collaborating with Betagro Limited, one of Thailand s largest food producers in agroindustry who engages in fresh and frozen chicken and hygienic pork processing, we learnt that a great amount of data is generated during the production process and stored in a huge data warehouse for quality control process. In many cases, the production records contain a set of parameters that is too large to be monitored simultaneously. In addition, we found that the business model of Betagro is make-to-order, meaning that the manufacturing schedule is not routine but varies according to the demands and sale orders received in each period. For instance, the same production line may produce different products each day. Consequently, monitoring the production line should be done continuously and real-time anomaly detection is desirable. It is expected that the real-time process will promote a prompt response, which will result in effective error prevention. For instance, once the potentially defective pattern is detected, the system will generate an alert and prompt the engineer to fix it immediately. Specifically, there are three stations in the production line; Preparation, Cooking and IQF as illustrated in Figure 1.1. The process starts from preparing the raw materials (i.e. raw meat) in Preparation; the material will be measured and marinated with sauces. Then in the Cooking process, the material is marinated and cut into a shape per order. Last is Individual Quick Freezing (IQF), which includes the packing and metal detection processes. Throughout the manufacturing processes, the quality control unit continuously records the environmental and production parameters to facilitate the monitoring system, namely, the anomaly detection system.
12 2 Figure 1.1 Manufacturing Process In this work, the records from the production line were adopted as experimental data. The dataset consists of 51 parameters (for example: after tumbling weight, raw material temperature and oil temperature) as presented in Appendix A. Statistical process control is a technique of examining the variation of the production parameters. The current anomaly detection system was developed with the predefined set of parameters to be monitored. An individual parameter is monitored if the value is irregular. However, in many instances, the alert turned out to be redundant and this results in an ineffective monitoring. For instance, there were over 10,000 alerts in a month. This makes it impossible for the engineers to revise and fix every alert, which means that there will be a greater chance of the important alerts to go undetected. Thus, we are proposing a feature selection framework to automatically obtain the subset of informative parameters and identify the representatives of the deducted parameters. The framework is expected to be a general-purpose tool but it is also customized as an add-on module for the current anomaly detection system. It has learned from the past records of parameter values and recognizes the patterns of anomaly. The information is used to categorize and identify relationships between parameters. By this methodology, the number of parameters to be monitored in the future detection process can be minimized and the number of alerts can be reduced significantly. As a result, the efficiency of an alert process can be optimized. The learning process will be scheduled as appropriate, i.e. monthly or weekly. The selection process is designed to be a pre-process, however in reality; the production line is changed daily and non-deterministic. The updated set of representative parameters is expected to be ready before the production starts. Thus, the processing time should be minimized. To allow the feature selection to process in time for the start of the daily make-to-order manufacturing, fast computation is necessary to handle massively high-dimensional data. It will then be beneficial to parallelize the designed technique. However, as such
13 3 large amount of data is being generated in a single day; the size of data is of concern. We adopt MapReduce to accelerate the methodology and handle the large size of data. MapReduce is a linearly scalable programming paradigm developed by Google in 2004 based on Key-Value pair. It was designed to support data-intensive tasks that require processing of the whole dataset in a batch fashion . The framework automatically parallelizes and executes the program on a cluster. The runtime system handles the details of data partitioning, scheduling and failures. The emergence of this framework brings about the great opportunity to empower existing algorithms with the ability to handle large populations. Many researchers respond to the need of processing huge volumes of data by scaling algorithms with the help of MapReduce; one of the most commonly used implementations is Hadoop. Hadoop is a parallel and distributed programming framework implemented on MapReduce paradigm. It consists of 2 parts; Hadoop Distributed File System or HDFS and a MapReduce engine. HDFS is introduced because a big-data analysis does not only raise the issue of computation time but also how fast the data can be accessed. It promotes one obvious remarkable characteristic of Hadoop how the compute node accesses the data to be processed . It also achieves the reliability of the stored data by replicating the files across multiple nodes. For the MapReduce engine, the framework automatically parallelizes and executes the application on the cluster of machines while abstracting the complexity of big data analysis, for instance, data partitioning, data distribution, fault tolerance and load balancing. The engine was designed with shared-nothing architecture; thus, the framework can perform well especially with the application with the set of independent tasks. Another remarkable benefit of Hadoop is the concept of data locality. The scheduler of the framework will automatically co-locate the computing node with the data node. In other words, it prioritizes the computing node that is close to the data node which can significantly reduce the communication overhead and traffic from the data transfer between nodes. The measurement of close is the physical location of the node, for example, the same node is the closest and then the node in the same rack will be considered consequently. Another considerable benefit from Hadoop is that the framework was designed to run on the commodity hardware, thus, it can be integrated to the current system by leveraging the existing hardware. In summary, we have designed a framework of feature selection to minimize the number of parameters to be monitored and reduce the number of alerts as a consequence. The subset of parameters will be formulated and the representatives will be identified. Hadoop will be integrated in two primaries. The first is computation; the algorithm will be parallelized across the compute nodes to minimize the computing time. The second is storage; the production data will be stored in HDFS to accelerate the data access since the process will learn from the historical records. This study is expected to contribute to the big data analysis community with the design of parallel algorithm on the Hadoop platform. It also contributes to the food manufacturing industry as we focused on the practicality by customizing the framework to suit the food manufacturing process.
14 4 For a better understanding, in this work, the term feature refers to the parameter of manufacturing process. Thus, an abnormal feature refers to the parameter of which value is out of the given boundary while the normal feature means to the parameter with the desired value. The thesis is divided into six Chapters. Literature Survey is presented in Chapter 2 in which we discussed about several techniques in feature selection and relevant research. Background and Related Theories are also summarized at the end of the chapter. Chapter 3 illustrates the proposed framework, parallel algorithm and evaluation methods. Chapter 4 presents the experimental designs and results. Chapter 5 discusses about the quality of the results. The last chapter is the conclusion of the entire work.
15 CHAPTER 2 LITERATURE REVIEW AND RELATED THEORIES In recent years, different algorithms have been employed to identify significance of features in order to tackle low dimensional problems by deducting irrelevant attributes. An appropriate technique depends on the characteristics of the problem and the dataset. This chapter summarizes related research adopting different techniques, aiming to reduce the dimension of the problem. In addition, some related theories and algorithms are included at the end of this chapter. 2.1 Literature review Feature Selection Feature selection is a method to identify the set of significant features to obtain the optimal subset which strongly affects the yield. The simplest method is to try every possibility which is computationally intractable. It is highly recommended to carefully analyze the data with appropriate techniques. There are three major categories of feature selection; Filter, Embedded and Wrapper techniques, which are summarized in Table 2.1. Table 2.1 Feature Selection Categories Technique Processing time Feature Dependency Limitation Example Algorithms Filter Fast No Ignore possible dependency between features Embedded Fast Yes Modification to classification algorithm required. Wrapper Slow Yes Computationally intensive Statistical techniques LDA, AUC and SVM GA-DT , ABCkNN  The first is Filter technique, composed of two individual tasks: selection and evaluation. One of the most widely used Filter techniques in the manufacturing plant is to apply the statistical methods. In , they applied the factorial based method to screen the manufacturing parameters. The proposed methodology was a combination of Principle Component Analysis (PCA) and factorial designs. PCA with the correlation matrix was performed to reduce the set of features. Then, they fit the regression model of the yield with the 2-level encoded parameters. Lastly, they performed the full factorial design on
16 6 the selected features to identify the significant interactions. They adopted the manufacturing data of 5,000 instances; distributed to Pass and Fail instances. The result showed that this method could identify 80% of parameters previously selected by the experts and some additional parameters. They claimed that the proposed methodology could handle a large number of parameters very well. The second is Embedded technique, which is building the selection process inside a classifier. The general and widely used is the third, Wrapper technique, because the implementation is simple and can be used generally. This approach is to wrap the selection process with the classification technique. Once the subset is selected, the searching process will be guided by the performance of classifier. In other words, the subset is scored by the classification accuracy calculated from the candidate subsets and indicates if the optimal solution had been reached. Since the classification must be run iteratively, the task is computationally intensive which requires a relatively long execution time. However, with the advancement in high performance computing power, this issue is manageable. In 2011, K. Chrysostomou et al. proposed a methodology to investigate the relationship between computer experiences and users preference towards search engines . They proposed a feature selection technique called Wrapper-based Decision Tree (WDT). Wrapper feature selection approach was adopted to accurately select the set of features. The selection method obtained a mutual set among four classifiers, due to each having its specific bias, as follow: Bayesian Network (BN), Decision Tree (DT), knn and Support Vector Machines (SVM). Once the subset of features was obtained, they applied DT algorithm to evaluate the subset of features. The subset that produces the tree with the highest accuracy was then selected. The result showed great potential for three-classifier combination (BN, DT and KNN) that produced the most informative set of features. In addition, it was to be noted that the article adopted only a small set of classifiers for features selection. Experimenting with a wider range of algorithms could be considered as an interesting prospect. Two of the most widely used and effective algorithms are Evolutionary and Swarm Intelligence algorithm. The most popular EA is Genetic Algorithm (GA) and the classical example of Swarm Intelligence is Artificial Bee Colony (ABC). Both were widely used and employed in a great deal of research as discussed in the following paragraph. An article of Reducing bioinformatics data dimension with ABC-kNN  proposed an algorithm to reduce the complexity of computation by obtaining the low dimensional problem. They adopted ABC to reduce the dimensions of dataset by identify the irrelevant features. This worked per iteration; once the employed and onlooker bees obtained the new food source, in other word, the new subset of features, the knn accuracy was then computed as a fitness function to evaluate the subset. The proposed algorithm was applied to the dataset of 140 records with 32 features. The result yielded 85% of classification accuracy with a set of selected eight features. S. Geetha et al. proposed an evolving decision tree classifier (EDTC)  that integrated GA and DT for optimal quantitative feature selection and pattern classification. They aimed to determine whether a multimedia object in Steganography was modified. The goal was to minimize the number of features in each group, in other words, to obtain a
17 7 low dimensional problem. The dataset of 200 audio samples was classified as a clean, benign or adulterated signal. Each record was of 25 features. The accuracy of classification was evaluated by producing a decision tree with a set of features in which the accuracy was computed as a fitness function. The result presented a great potential of GA over an existing method. Another effective combination was proposed by M. Su et al. . The research proposed a feature-weighting algorithm implemented on GA and combined with knn. The features of interest are of a quantitative type. GA created chromosome as a subset of features. knn was then assigned to perform the evaluation by computing the accuracy rate which was considered as a fitness value. The chromosome with the larger fitness value was promoted to be a parent. Then, two-point crossover was applied to exchange genes between chromosomes which resulted in children. The mutation constraint was set in such a manner that 10% larger of fitness value would be accepted; otherwise the mutation would be ignored. The algorithm was tested with a dataset with 35 features % accuracy was yielded for top 19 features ranked by the proposed methodology. Table 2.2 is the comparison of GA and ABC in two dimensions, Exploration and Exploitation. Exploration is the capability to create the population diversity by exploring the search space. In contrast, Exploitation is the reduction of diversity by exploiting the neighborhood. Schema theorem  remarked that the crossover operation is an efficient exploration because it preserves the good part from different parents rather than randomization only. Therefore, the combination of the good portions offers better offspring. As a result, ABC tends to converge to local optimal faster than GA. However, GA requires a longer computation time. Table 2.2 GA and ABC Algorithm Exploration Exploitation Genetic Algorithm Artificial Bees Colony The crossover operator erases the link between parents which leads an algorithm to the unexplored search space The scout bees explore the new food source by randomly selecting the new solution in the unexplored search space The mutation operator exploits the neighborhood by adjusting the current solution with the step size The onlooker bees exploit the neighbor by randomly adjusting the current solution In addition, we are aiming to minimize the execution time by parallelizing an algorithm, as discussed by R. Arora et al. in , GA is suitable for the parallelization with the low dependent task characteristic. Specifically, the fitness evaluation and reproduction process are independent tasks and have Markov property. The Markov property defines a memoryless system. Specifically, when the process is about to move to the next state, it depends only on the current state. For instance, the reproduction can produce a new generation with only the information from the fitness evaluation process.
18 Classification Algorithm As described earlier, the evaluation process of feature selection usually adopts the classification techniques to score the subset. Selecting an appropriate technique depends on the purpose of classification; the characteristic of data should be considered as well. This section discusses the usage of classification techniques with real world data. Classification is a supervised learning method aiming at the predictive purpose by exploring the references reflected by groups of data. The current active research area particularly focuses on few classifiers which are proved to successfully solve the classification problems in many domains. This section discusses about the four widely used algorithms which are Decision tree (DT), k-nearest Neighbor (knn), Support Vector Machine (SVM) and Naïve Bayes (NB) algorithm. Decision Tree is an algorithm to visualize the relationship among parameters to create a predictive model. Recently, many algorithms have been developed to construct a tree such as AID, SERCH, CHAID, CARD, OC, ID3, C4.5, C5, QUEST, and SAS algorithms . The work in  presented an implementation of fault diagnosis with C4.5 decision tree algorithm. C4.5 is a classifier with high accuracy and simple operation implemented on DT algorithm. They visualized the pump state from the vibration signals, which were measured with 9 statistical techniques, such as, Standard Error, Standard Deviation, Variance and etc. The statistical techniques were considered as features. The classifier was trained with 150 samples and validated with 100 samples. The test dataset yielded 100% of accuracy. They pointed out a critical issue for the classification that the algorithm should not overfit the data during the training because the accuracy might be affected later. The work in  also applied C4.5 decision tree classifier to detect the failure pattern of the disk drive manufacture. They adopted the dataset of 10,000 instances from the HGA production line for the experiment. The dataset was categorized into 2 classes; Pass and Fail. The decision tree was trained with 8,000 instances and tested with another 2,000 instances. The used the corrected gain ratio as node splitting criteria. The classification accuracy was 92.3%. They discussed that their method could handle the missing value attributes very well. knn is a class-based learning method. The method is to classify the data by measuring the similarity with k representatives. In , T. L. Priya et al. proposed a speech identification classification with knn algorithm. They aimed to classify the voice with the data from speech processing tools. The data were classified into 2 groups, which are normal and disordered voice. With the traditional knn, 80% of classification accuracy was yielded. They discussed that knn is not only the simplest algorithm to implement but it is also an efficient algorithm. The work in  proposed an improved knn algorithm based on the density of data for the text classification. They raised a concern about the sensitivity of an algorithm to the uneven distribution that may significantly affect the accuracy of the algorithm. For instance, in the high-density area, the similarity between objects will be much larger
19 9 than the low-density area. As a result, the probability of selecting those from the high density will be higher; thus the knn classification technique was proposed with the consideration on data distribution. The methodology was to balance between the density and similarity factor, for example, when the density is low and the nearest neighbor is much lower, the similarity should be amplified. In contrast, if the density is high and the neighbor is much higher, the similarity should be subtracted. They validated an algorithm with text classification application and compared with the traditional methodology. They found that the classification accuracy had been greatly improved and became more stable. SVM is another well-known classification algorithm. It is generally used to build the model for analyzing data and recognizing the patterns. The concept of SVM is based on the decision planes which are used to divide the data points to the classes while maximizing the margin between groups to achieve the best discrimination. The work in  applied SVM to predict the turbines failure in the production environment of thermal power plant. The dataset of 10,822 instances was collected from the power plant. Each record consisted of 29 monitoring parameters, all was of continuous type. The records were categorized into 3 groups; normal, low and abnormal. They used the sampling method with 70% as the training data and another 30% as the testing data. They compared SVM against Back-Propagation Neural Networks (BPN) and Linear Discriminant Analysis (LDA). The results showed that SVM achieved the highest accuracy at 93.13% while BPN and LDA achieved only 85.57% and 75.95% respectively. They also discussed that SVM could handle a large amount of data very well. The work in  compared the performance of SVM and Naïve Bayes on the text classification with data which were classified into two classes; interesting and uninteresting. The algorithms were evaluated with three datasets (highly imbalanced, moderately imbalanced and balanced dataset). The results showed that SVM outperformed Naïve Bayes on every dataset. They discussed that while the sparseness of these datasets is crucial for Naïve Bayes, it does not affect the performance of SVM. NB is a probabilistic classifier based on the Bayes theorem. It was proved to work quite well with many real-world data especially the high-dimensional data. In , they applied Naive Bayes to classify the texture from the moving robot s data. The raw vibration signal data were collected from the deflection of the artificial whiskers attached to the robot. The data were of four classes; two different carpets, tarmac and vinyl surface. They adopted the sampling technique to divide the dataset to training and testing data with 0.5 sampling rate. The results yielded that the classifier accurately predicted the class with a 80% hit rate and increased to 90% when they adopted more data. They discussed the essentiality of having enough data to train the model. Table 2.3 shows the advantages and disadvantages of the four classifiers. We have learned that the remarkable advantage of knn over other classifiers in the feature selection context is the sensitivity to the irrelevant attributes. This characteristic can significantly contribute to the selection process because it is likely that knn will detect when the selection algorithm proposes a candidate with irrelevant attributes. Also, knn is robust to noise. This characteristic is crucial to the real world data because, in reality, the data can neither be well distributed nor complete. In addition, knn is claimed to be
20 10 effective and appropriate to process with a large dataset. However, even the running time and performance of knn is not comparable with other classifiers but due to the availability of computational power and high performance platform, this issue is not severe. Table 2.3 Comparison of 4 Classifiers knn Decision Tree Advantage - Robust to noise - Sensitive to irrelevant attributes - Good runtime performance - Capable of handling both nominal and numeric Disadvantage - Memory intensive - Slow and poor runtime performance (esp. for large dataset) - Sensitive to noise SVM - Robust to sparseness - Slow and poor runtime performance Naïve Bayes - Good runtime performance - Sensitive to dependency of attributes Big Data Analysis The perfect combination of algorithms is not always the best answer as a large amount of data beats the better algorithm . With regards to the dramatically increasing in the amount of data; parallel and distributed processing are popular practices. It is of general interest to integrate algorithms with high performance platform. Cloud computing is an efficient paradigm for both compute and data intensive applications. MapReduce is widely adopted to maximize the efficiency and parallelize the algorithm on cloud infrastructure. Many articles responded to the needs to process huge volumes of data by scaling algorithms with MapReduce and its implementation that follows. The most common use is Hadoop, employed extensively by a number of organizations. The emergence of framework brings about a great opportunity to empower existing algorithms with the ability to handle large populations. In , the article of efficient parallel knn join for large data in MapReduce proposed a novel method extended from knn joins to process a large amount of data. knn join is a combination of joins and Nearest Neighborhood search. Both approaches are efficient but costly operations. Thus, they parallelized an algorithm in MapReduce to improve the performance. An article discussed the complexity of such straightforward method to integrate MapReduce with knn joins, Block Nested Loop. The worst case was up to square root of data set size O(n 2 ), in other words, the number of reducer required was quadratic and the dataset needed to be duplicated to n 2 pieces. Hence, they leveraged a space-filling curve (z-value) to transform knn joins into a sequence of one-dimensional search. There were three pairs of mapper-reducer being employed during the process.