MAPREDUCE FRAMEWORK FOR ANOMALY DETECTION IN MANUFACTURING DATA MISS SIKANA TANUPABRUNGSUN

Size: px
Start display at page:

Download "MAPREDUCE FRAMEWORK FOR ANOMALY DETECTION IN MANUFACTURING DATA MISS SIKANA TANUPABRUNGSUN"

Transcription

1 MAPREDUCE FRAMEWORK FOR ANOMALY DETECTION IN MANUFACTURING DATA MISS SIKANA TANUPABRUNGSUN A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF ENGINEERING (COMPUTER ENGINEERING) FACULTY OF ENGINEERING KING MONGKUT S UNIVERSITY OF TECHNOLOGY THONBURI 2013

2 Mapreduce Framework for Anomaly Detection in Manufacturing Data Miss Sikana Tanupabrungsun B.Eng. (Computer Engineering) A Thesis Submited in Partial Fulfillment of the Requirement for the Degree of Master of Engineering (Computer Engineering) Faculty of Engineering King Mongkut s University of Technology Thonburi 2013 Thesis Committee (Asst. Prof. Marong Phadoongsidhi, Ph.D.). (Assoc. Prof. Tiranee Achalakul, Ph.D.). (Santitham Prom-on, Ph.D.). (Watcharapan Suwansantisuk, Ph.D.). ( Li Xiaorong, Ph.D.) Chairman of Thesis Committee Thesis Advisor Committee Committee Committee Copyright reserved

3 i Thesis Title MapReduce Framework for Anomaly Detection in Manufacturing Data Thesis Credits 12 Credits Candidate Miss Sikana Tanupabrungsun Thesis Advisor Assoc. Prof. Dr. Tiranee Achalakul Program Master of Engineering Field of Study Computer Engineering Department Computer Engineering Faculty Engineering Academic Year 2013 Abstract Manufacturing data is an important source of knowledge that can be used to enhance the production capability. Detecting of the causes of defects possibly leads to an improvement in the production. However, the production records generally contain an enormous set of features and the number of alert messages is redundant. Thus, it is almost impossible in practice to monitor all features at once. This research proposes the feature reduction framework, which is designed to identify a subset of informative features and the correlation groups. By monitoring fewer features, the number of alert messages can be decreased. In our methodology, manufacturing data are pre-processed and adopted as inputs. Subsequently, the feature selection process is performed by wrapping Genetic Algorithm (GA) with the k-nearest Neighbor (knn) classifier. To improve the performance, the proposed technique was parallelized with MapReduce. The results showed that the number of features can be reduced by 49.02% with 83.95% accuracy. In addition, with MapReduce on the cloud, the performance was increased by times. The framework result was validated by both statistical method and expert analysts from the manufacturing industry. Keywords : Data analysis/ Feature selection/ Genetic Algorithm (GA)/ k-nearest Neighbor (knn)/ Manufacturing data

4 ii ห วข อว ทยาน พนธ การค ดเล อกกล มพาราม เตอร ท เหมาะสมส าหร บระบบตรวจจ บความผ ดปกต ของสายการผล ตบนแมปร ด วซ หน วยก ต 12 ผ เข ยน นางสาว ศ คณา ธน ภาพร งสรรค อาจารย ท ปร กษา รศ. ดร.ธ รณ อจลาก ล หล กส ตร ว ศวกรรมศาสตรมหาบ ณฑ ต สาขาว ชา ว ศวกรรมคอมพ วเตอร ภาคว ชา ว ศวกรรมคอมพ วเตอร คณะ ว ศวกรรมศาสตร ป การศ กษา 2556 บทค ดย อ ข อม ลสายการผล ตน บว าเป นแหล งความร ส าค ญของธ รก จต างๆในการเพ มประส ทธ ภาพของ สายการผล ต การตรวจจ บสาเหต ความผ ดปกต จ งเป นส วนส าค ญในการพ ฒนาและร กษาค ณภาพของ ผล ตภ ณฑ อย างไรก ตาม ข อม ลสายการผล ตม กประกอบด วยต วแปรเป นจ านวนมาก และระบบ ตรวจจ บความผ ดปกต ม กสร างข อความแจ งเต อนในปร มาณท มากเก นไป ส งผลให การว เคราะห และ ตรวจจ บความผ ดปกต เป นกระบวนการท ย งยากและซ บซ อน งานว จ ยน น าเสนอเฟรมเว ร คการ ค ดเล อกกล มฟ เจอร ท เหมาะสมและม ผลกระทบต อการผล ต กระบวนการว เคราะห ข อม ลเร มจากการ เตร ยมข อม ลสายการผล ตในเบ องต นเพ อจ ดร ปแบบให เหมาะสมและท าการค ดเล อกฟ เจอร โดย ประย กต ใช เจเนต ก อ ลกอร ท ม (Genetic Algorithm) ร วมก บ k-nearest Neighbor (knn) ซ งเป น อ ลกอร ท มการจ ดกล มข อม ล เจเนต ก อ ลกอร ท มท าหน าท ค ดเล อกและน าเสนอกล มของฟ เจอร ท เป นไปได ซ งจะถ กประเม นด วย knn อย างไรก ตาม เน องจากการค ดเล อกกล มฟ เจอร เป น กระบวนการท ต องท าอย างต อเน อง ผ ว จ ยจ งน าอ ลกอร ท มน ไปประมวลผลบน MapReduce เพ อเพ ม ประส ทธ ภาพและลดระยะเวลาในการประมวลผล จากผลการทดลอง เฟรมเว ร คน สามารถลดจ านวน ฟ เจอร ได 49.02% โดยย งร กษาค าความแม นย าท 83.95% และด วยการประมวลผลบน MapReduce เฟรมเว ร คม ความเร วข นถ ง เท า ผลล พธ ของเฟรมเว ร คน ได ร บการร บรองค ณภาพท งด วยว ธ ทาง สถ ต และความค ดเห นจากผ เช ยวชาญท โรงงาน ค าส าค ญ : การว เคราะห ข อม ล/ การค ดเล อกพาราม เตอร / เจเนต ก อ ลกอร ท ม/ k-nearest Neighbor (knn)/ ข อม ลสายการผล ต

5 iii ACKNOWLEDGEMENTS I would like to express my appreciation and thanks to my advisor, Assoc. Prof. Tiranee Achalakul. Without her supervision and constant help, this research could not be possible. I would also like to thank all the CAST Lab members for a great time during my study. I would also like to extend my appreciation to Ms. Vimonrat Wongsathit and her colleagues at Betagro Limited for their help in validating the results and advices. Last but not least, a special thanks to my family for the support and encouragement throughout my study.

6 iv CONTENTS ENGLISH ABSTRACT THAI ABSTRACT ACKNOWLEDGEMENT CONTENTS LIST OF TABLES LIST OF FIGURES LIST OF EQUATIONS PAGE i ii iii iv vi vii viii CHAPTER 1. INTRODUCTION 1 2. LITERATURE REVIEW AND RELATED THEORIES Literature Review Related Theories PROPOSED WORK Framework Design Parallel Algorithm Evaluation Methods DESIGN AND ANALYSIS OF THE EXPERIMENTS Framework Results Framework Accuracy Framework Scalability RESULT DISCUSSION Quality of Solution Subset Validation Adoption Plan 46 CONCLUSION 47 REFERENCES 48

7 v CONTENTS (Cont.) PAGE APPENDIX 51 A. Complete Information of Manufacturing Parameters 51 B. Scalability Result 54 C. PCA Result From the Correlation Matrix 55 D. Validation Result From Expert Analysts (EA) 56 CURRICULUM VITAE 58

8 vi LIST OF TABLES TABLE PAGE 2.1 Feature Selection Category Genetic Algorithm and ABC Comparison of 4 Classifiers Genetic Terms in Context of Computation Correlation Groups Solution Grouped by the Correlation Groups Warning Information Parameters Setting Accuracy From Full Features and Reduced Features T-test Results Datasets Characteristic Accuracy from [26] and MapReduce GA/kNN General-purpose Machine Profiles List of Clusters Speedup and Efficiency 43

9 vii LIST OF FIGURES FIGURE PAGE 1.1 Manufacturing Process Genetic Algorithm Flowchart Binary Encoding Permutation Encoding Value Encoding N-Point Crossover Uniform Crossover Mutation Nearest Neighbor Time Series Plot Modified Workflow Alert Record Formatted Batch Alert Record Example of Encoded Chromosome Pseudo-code of knn Classifier knn Implementation on Hadoop Plot of Execution time and Instances 42

10 viii LIST OF EQUATIONS EQUATION PAGE 3.1 Chromosome Encoding Classification Accuracy Fitness Function Reduction Rate Number of Instances 40

11 CHAPTER 1 INTRODUCTION Knowledge and solutions to problems are always locked up in the flood of data. However, real world data are often complex and rarely systematical, usually exploited for direct, non-sophisticated observation. Data analytic is a science of raw data analysis, aiming at drawing appropriate conclusion and explanation for the given data. Data mining is a process of discovering knowledge by analyzing the data with different techniques and perspectives, aiming at exploring the patterns and trends of data. It is a method to turn data to knowledge and this has contributed to various businesses for decades. Different techniques have been applied to industries since the 1990s and they have progressed ever since in many different areas [1]. It has also become a topic of great interests to modern researchers. Knowledge is the most valuable asset for any businesses as it provides an opportunity for a business to differentiate itself from competitors. Manufacturing data is widely realized as a source of knowledge to detect the causes of certain defects, which can lead to the potential solutions for product quality control. Food production is one of the manufacturing processes in which quality control should be intensively considered. Collaborating with Betagro Limited, one of Thailand s largest food producers in agroindustry who engages in fresh and frozen chicken and hygienic pork processing, we learnt that a great amount of data is generated during the production process and stored in a huge data warehouse for quality control process. In many cases, the production records contain a set of parameters that is too large to be monitored simultaneously. In addition, we found that the business model of Betagro is make-to-order, meaning that the manufacturing schedule is not routine but varies according to the demands and sale orders received in each period. For instance, the same production line may produce different products each day. Consequently, monitoring the production line should be done continuously and real-time anomaly detection is desirable. It is expected that the real-time process will promote a prompt response, which will result in effective error prevention. For instance, once the potentially defective pattern is detected, the system will generate an alert and prompt the engineer to fix it immediately. Specifically, there are three stations in the production line; Preparation, Cooking and IQF as illustrated in Figure 1.1. The process starts from preparing the raw materials (i.e. raw meat) in Preparation; the material will be measured and marinated with sauces. Then in the Cooking process, the material is marinated and cut into a shape per order. Last is Individual Quick Freezing (IQF), which includes the packing and metal detection processes. Throughout the manufacturing processes, the quality control unit continuously records the environmental and production parameters to facilitate the monitoring system, namely, the anomaly detection system.

12 2 Figure 1.1 Manufacturing Process In this work, the records from the production line were adopted as experimental data. The dataset consists of 51 parameters (for example: after tumbling weight, raw material temperature and oil temperature) as presented in Appendix A. Statistical process control is a technique of examining the variation of the production parameters. The current anomaly detection system was developed with the predefined set of parameters to be monitored. An individual parameter is monitored if the value is irregular. However, in many instances, the alert turned out to be redundant and this results in an ineffective monitoring. For instance, there were over 10,000 alerts in a month. This makes it impossible for the engineers to revise and fix every alert, which means that there will be a greater chance of the important alerts to go undetected. Thus, we are proposing a feature selection framework to automatically obtain the subset of informative parameters and identify the representatives of the deducted parameters. The framework is expected to be a general-purpose tool but it is also customized as an add-on module for the current anomaly detection system. It has learned from the past records of parameter values and recognizes the patterns of anomaly. The information is used to categorize and identify relationships between parameters. By this methodology, the number of parameters to be monitored in the future detection process can be minimized and the number of alerts can be reduced significantly. As a result, the efficiency of an alert process can be optimized. The learning process will be scheduled as appropriate, i.e. monthly or weekly. The selection process is designed to be a pre-process, however in reality; the production line is changed daily and non-deterministic. The updated set of representative parameters is expected to be ready before the production starts. Thus, the processing time should be minimized. To allow the feature selection to process in time for the start of the daily make-to-order manufacturing, fast computation is necessary to handle massively high-dimensional data. It will then be beneficial to parallelize the designed technique. However, as such

13 3 large amount of data is being generated in a single day; the size of data is of concern. We adopt MapReduce to accelerate the methodology and handle the large size of data. MapReduce is a linearly scalable programming paradigm developed by Google in 2004 based on Key-Value pair. It was designed to support data-intensive tasks that require processing of the whole dataset in a batch fashion [2]. The framework automatically parallelizes and executes the program on a cluster. The runtime system handles the details of data partitioning, scheduling and failures. The emergence of this framework brings about the great opportunity to empower existing algorithms with the ability to handle large populations. Many researchers respond to the need of processing huge volumes of data by scaling algorithms with the help of MapReduce; one of the most commonly used implementations is Hadoop. Hadoop is a parallel and distributed programming framework implemented on MapReduce paradigm. It consists of 2 parts; Hadoop Distributed File System or HDFS and a MapReduce engine. HDFS is introduced because a big-data analysis does not only raise the issue of computation time but also how fast the data can be accessed. It promotes one obvious remarkable characteristic of Hadoop how the compute node accesses the data to be processed [3]. It also achieves the reliability of the stored data by replicating the files across multiple nodes. For the MapReduce engine, the framework automatically parallelizes and executes the application on the cluster of machines while abstracting the complexity of big data analysis, for instance, data partitioning, data distribution, fault tolerance and load balancing. The engine was designed with shared-nothing architecture; thus, the framework can perform well especially with the application with the set of independent tasks. Another remarkable benefit of Hadoop is the concept of data locality. The scheduler of the framework will automatically co-locate the computing node with the data node. In other words, it prioritizes the computing node that is close to the data node which can significantly reduce the communication overhead and traffic from the data transfer between nodes. The measurement of close is the physical location of the node, for example, the same node is the closest and then the node in the same rack will be considered consequently. Another considerable benefit from Hadoop is that the framework was designed to run on the commodity hardware, thus, it can be integrated to the current system by leveraging the existing hardware. In summary, we have designed a framework of feature selection to minimize the number of parameters to be monitored and reduce the number of alerts as a consequence. The subset of parameters will be formulated and the representatives will be identified. Hadoop will be integrated in two primaries. The first is computation; the algorithm will be parallelized across the compute nodes to minimize the computing time. The second is storage; the production data will be stored in HDFS to accelerate the data access since the process will learn from the historical records. This study is expected to contribute to the big data analysis community with the design of parallel algorithm on the Hadoop platform. It also contributes to the food manufacturing industry as we focused on the practicality by customizing the framework to suit the food manufacturing process.

14 4 For a better understanding, in this work, the term feature refers to the parameter of manufacturing process. Thus, an abnormal feature refers to the parameter of which value is out of the given boundary while the normal feature means to the parameter with the desired value. The thesis is divided into six Chapters. Literature Survey is presented in Chapter 2 in which we discussed about several techniques in feature selection and relevant research. Background and Related Theories are also summarized at the end of the chapter. Chapter 3 illustrates the proposed framework, parallel algorithm and evaluation methods. Chapter 4 presents the experimental designs and results. Chapter 5 discusses about the quality of the results. The last chapter is the conclusion of the entire work.

15 CHAPTER 2 LITERATURE REVIEW AND RELATED THEORIES In recent years, different algorithms have been employed to identify significance of features in order to tackle low dimensional problems by deducting irrelevant attributes. An appropriate technique depends on the characteristics of the problem and the dataset. This chapter summarizes related research adopting different techniques, aiming to reduce the dimension of the problem. In addition, some related theories and algorithms are included at the end of this chapter. 2.1 Literature review Feature Selection Feature selection is a method to identify the set of significant features to obtain the optimal subset which strongly affects the yield. The simplest method is to try every possibility which is computationally intractable. It is highly recommended to carefully analyze the data with appropriate techniques. There are three major categories of feature selection; Filter, Embedded and Wrapper techniques, which are summarized in Table 2.1. Table 2.1 Feature Selection Categories Technique Processing time Feature Dependency Limitation Example Algorithms Filter Fast No Ignore possible dependency between features Embedded Fast Yes Modification to classification algorithm required. Wrapper Slow Yes Computationally intensive Statistical techniques LDA, AUC and SVM GA-DT [7], ABCkNN [6] The first is Filter technique, composed of two individual tasks: selection and evaluation. One of the most widely used Filter techniques in the manufacturing plant is to apply the statistical methods. In [4], they applied the factorial based method to screen the manufacturing parameters. The proposed methodology was a combination of Principle Component Analysis (PCA) and factorial designs. PCA with the correlation matrix was performed to reduce the set of features. Then, they fit the regression model of the yield with the 2-level encoded parameters. Lastly, they performed the full factorial design on

16 6 the selected features to identify the significant interactions. They adopted the manufacturing data of 5,000 instances; distributed to Pass and Fail instances. The result showed that this method could identify 80% of parameters previously selected by the experts and some additional parameters. They claimed that the proposed methodology could handle a large number of parameters very well. The second is Embedded technique, which is building the selection process inside a classifier. The general and widely used is the third, Wrapper technique, because the implementation is simple and can be used generally. This approach is to wrap the selection process with the classification technique. Once the subset is selected, the searching process will be guided by the performance of classifier. In other words, the subset is scored by the classification accuracy calculated from the candidate subsets and indicates if the optimal solution had been reached. Since the classification must be run iteratively, the task is computationally intensive which requires a relatively long execution time. However, with the advancement in high performance computing power, this issue is manageable. In 2011, K. Chrysostomou et al. proposed a methodology to investigate the relationship between computer experiences and users preference towards search engines [5]. They proposed a feature selection technique called Wrapper-based Decision Tree (WDT). Wrapper feature selection approach was adopted to accurately select the set of features. The selection method obtained a mutual set among four classifiers, due to each having its specific bias, as follow: Bayesian Network (BN), Decision Tree (DT), knn and Support Vector Machines (SVM). Once the subset of features was obtained, they applied DT algorithm to evaluate the subset of features. The subset that produces the tree with the highest accuracy was then selected. The result showed great potential for three-classifier combination (BN, DT and KNN) that produced the most informative set of features. In addition, it was to be noted that the article adopted only a small set of classifiers for features selection. Experimenting with a wider range of algorithms could be considered as an interesting prospect. Two of the most widely used and effective algorithms are Evolutionary and Swarm Intelligence algorithm. The most popular EA is Genetic Algorithm (GA) and the classical example of Swarm Intelligence is Artificial Bee Colony (ABC). Both were widely used and employed in a great deal of research as discussed in the following paragraph. An article of Reducing bioinformatics data dimension with ABC-kNN [6] proposed an algorithm to reduce the complexity of computation by obtaining the low dimensional problem. They adopted ABC to reduce the dimensions of dataset by identify the irrelevant features. This worked per iteration; once the employed and onlooker bees obtained the new food source, in other word, the new subset of features, the knn accuracy was then computed as a fitness function to evaluate the subset. The proposed algorithm was applied to the dataset of 140 records with 32 features. The result yielded 85% of classification accuracy with a set of selected eight features. S. Geetha et al. proposed an evolving decision tree classifier (EDTC) [7] that integrated GA and DT for optimal quantitative feature selection and pattern classification. They aimed to determine whether a multimedia object in Steganography was modified. The goal was to minimize the number of features in each group, in other words, to obtain a

17 7 low dimensional problem. The dataset of 200 audio samples was classified as a clean, benign or adulterated signal. Each record was of 25 features. The accuracy of classification was evaluated by producing a decision tree with a set of features in which the accuracy was computed as a fitness function. The result presented a great potential of GA over an existing method. Another effective combination was proposed by M. Su et al. [8]. The research proposed a feature-weighting algorithm implemented on GA and combined with knn. The features of interest are of a quantitative type. GA created chromosome as a subset of features. knn was then assigned to perform the evaluation by computing the accuracy rate which was considered as a fitness value. The chromosome with the larger fitness value was promoted to be a parent. Then, two-point crossover was applied to exchange genes between chromosomes which resulted in children. The mutation constraint was set in such a manner that 10% larger of fitness value would be accepted; otherwise the mutation would be ignored. The algorithm was tested with a dataset with 35 features % accuracy was yielded for top 19 features ranked by the proposed methodology. Table 2.2 is the comparison of GA and ABC in two dimensions, Exploration and Exploitation. Exploration is the capability to create the population diversity by exploring the search space. In contrast, Exploitation is the reduction of diversity by exploiting the neighborhood. Schema theorem [9] remarked that the crossover operation is an efficient exploration because it preserves the good part from different parents rather than randomization only. Therefore, the combination of the good portions offers better offspring. As a result, ABC tends to converge to local optimal faster than GA. However, GA requires a longer computation time. Table 2.2 GA and ABC Algorithm Exploration Exploitation Genetic Algorithm Artificial Bees Colony The crossover operator erases the link between parents which leads an algorithm to the unexplored search space The scout bees explore the new food source by randomly selecting the new solution in the unexplored search space The mutation operator exploits the neighborhood by adjusting the current solution with the step size The onlooker bees exploit the neighbor by randomly adjusting the current solution In addition, we are aiming to minimize the execution time by parallelizing an algorithm, as discussed by R. Arora et al. in [10], GA is suitable for the parallelization with the low dependent task characteristic. Specifically, the fitness evaluation and reproduction process are independent tasks and have Markov property. The Markov property defines a memoryless system. Specifically, when the process is about to move to the next state, it depends only on the current state. For instance, the reproduction can produce a new generation with only the information from the fitness evaluation process.

18 Classification Algorithm As described earlier, the evaluation process of feature selection usually adopts the classification techniques to score the subset. Selecting an appropriate technique depends on the purpose of classification; the characteristic of data should be considered as well. This section discusses the usage of classification techniques with real world data. Classification is a supervised learning method aiming at the predictive purpose by exploring the references reflected by groups of data. The current active research area particularly focuses on few classifiers which are proved to successfully solve the classification problems in many domains. This section discusses about the four widely used algorithms which are Decision tree (DT), k-nearest Neighbor (knn), Support Vector Machine (SVM) and Naïve Bayes (NB) algorithm. Decision Tree is an algorithm to visualize the relationship among parameters to create a predictive model. Recently, many algorithms have been developed to construct a tree such as AID, SERCH, CHAID, CARD, OC, ID3, C4.5, C5, QUEST, and SAS algorithms [11]. The work in [12] presented an implementation of fault diagnosis with C4.5 decision tree algorithm. C4.5 is a classifier with high accuracy and simple operation implemented on DT algorithm. They visualized the pump state from the vibration signals, which were measured with 9 statistical techniques, such as, Standard Error, Standard Deviation, Variance and etc. The statistical techniques were considered as features. The classifier was trained with 150 samples and validated with 100 samples. The test dataset yielded 100% of accuracy. They pointed out a critical issue for the classification that the algorithm should not overfit the data during the training because the accuracy might be affected later. The work in [13] also applied C4.5 decision tree classifier to detect the failure pattern of the disk drive manufacture. They adopted the dataset of 10,000 instances from the HGA production line for the experiment. The dataset was categorized into 2 classes; Pass and Fail. The decision tree was trained with 8,000 instances and tested with another 2,000 instances. The used the corrected gain ratio as node splitting criteria. The classification accuracy was 92.3%. They discussed that their method could handle the missing value attributes very well. knn is a class-based learning method. The method is to classify the data by measuring the similarity with k representatives. In [14], T. L. Priya et al. proposed a speech identification classification with knn algorithm. They aimed to classify the voice with the data from speech processing tools. The data were classified into 2 groups, which are normal and disordered voice. With the traditional knn, 80% of classification accuracy was yielded. They discussed that knn is not only the simplest algorithm to implement but it is also an efficient algorithm. The work in [15] proposed an improved knn algorithm based on the density of data for the text classification. They raised a concern about the sensitivity of an algorithm to the uneven distribution that may significantly affect the accuracy of the algorithm. For instance, in the high-density area, the similarity between objects will be much larger

19 9 than the low-density area. As a result, the probability of selecting those from the high density will be higher; thus the knn classification technique was proposed with the consideration on data distribution. The methodology was to balance between the density and similarity factor, for example, when the density is low and the nearest neighbor is much lower, the similarity should be amplified. In contrast, if the density is high and the neighbor is much higher, the similarity should be subtracted. They validated an algorithm with text classification application and compared with the traditional methodology. They found that the classification accuracy had been greatly improved and became more stable. SVM is another well-known classification algorithm. It is generally used to build the model for analyzing data and recognizing the patterns. The concept of SVM is based on the decision planes which are used to divide the data points to the classes while maximizing the margin between groups to achieve the best discrimination. The work in [16] applied SVM to predict the turbines failure in the production environment of thermal power plant. The dataset of 10,822 instances was collected from the power plant. Each record consisted of 29 monitoring parameters, all was of continuous type. The records were categorized into 3 groups; normal, low and abnormal. They used the sampling method with 70% as the training data and another 30% as the testing data. They compared SVM against Back-Propagation Neural Networks (BPN) and Linear Discriminant Analysis (LDA). The results showed that SVM achieved the highest accuracy at 93.13% while BPN and LDA achieved only 85.57% and 75.95% respectively. They also discussed that SVM could handle a large amount of data very well. The work in [17] compared the performance of SVM and Naïve Bayes on the text classification with data which were classified into two classes; interesting and uninteresting. The algorithms were evaluated with three datasets (highly imbalanced, moderately imbalanced and balanced dataset). The results showed that SVM outperformed Naïve Bayes on every dataset. They discussed that while the sparseness of these datasets is crucial for Naïve Bayes, it does not affect the performance of SVM. NB is a probabilistic classifier based on the Bayes theorem. It was proved to work quite well with many real-world data especially the high-dimensional data. In [18], they applied Naive Bayes to classify the texture from the moving robot s data. The raw vibration signal data were collected from the deflection of the artificial whiskers attached to the robot. The data were of four classes; two different carpets, tarmac and vinyl surface. They adopted the sampling technique to divide the dataset to training and testing data with 0.5 sampling rate. The results yielded that the classifier accurately predicted the class with a 80% hit rate and increased to 90% when they adopted more data. They discussed the essentiality of having enough data to train the model. Table 2.3 shows the advantages and disadvantages of the four classifiers. We have learned that the remarkable advantage of knn over other classifiers in the feature selection context is the sensitivity to the irrelevant attributes. This characteristic can significantly contribute to the selection process because it is likely that knn will detect when the selection algorithm proposes a candidate with irrelevant attributes. Also, knn is robust to noise. This characteristic is crucial to the real world data because, in reality, the data can neither be well distributed nor complete. In addition, knn is claimed to be

20 10 effective and appropriate to process with a large dataset. However, even the running time and performance of knn is not comparable with other classifiers but due to the availability of computational power and high performance platform, this issue is not severe. Table 2.3 Comparison of 4 Classifiers knn Decision Tree Advantage - Robust to noise - Sensitive to irrelevant attributes - Good runtime performance - Capable of handling both nominal and numeric Disadvantage - Memory intensive - Slow and poor runtime performance (esp. for large dataset) - Sensitive to noise SVM - Robust to sparseness - Slow and poor runtime performance Naïve Bayes - Good runtime performance - Sensitive to dependency of attributes Big Data Analysis The perfect combination of algorithms is not always the best answer as a large amount of data beats the better algorithm [19]. With regards to the dramatically increasing in the amount of data; parallel and distributed processing are popular practices. It is of general interest to integrate algorithms with high performance platform. Cloud computing is an efficient paradigm for both compute and data intensive applications. MapReduce is widely adopted to maximize the efficiency and parallelize the algorithm on cloud infrastructure. Many articles responded to the needs to process huge volumes of data by scaling algorithms with MapReduce and its implementation that follows. The most common use is Hadoop, employed extensively by a number of organizations. The emergence of framework brings about a great opportunity to empower existing algorithms with the ability to handle large populations. In [20], the article of efficient parallel knn join for large data in MapReduce proposed a novel method extended from knn joins to process a large amount of data. knn join is a combination of joins and Nearest Neighborhood search. Both approaches are efficient but costly operations. Thus, they parallelized an algorithm in MapReduce to improve the performance. An article discussed the complexity of such straightforward method to integrate MapReduce with knn joins, Block Nested Loop. The worst case was up to square root of data set size O(n 2 ), in other words, the number of reducer required was quadratic and the dataset needed to be duplicated to n 2 pieces. Hence, they leveraged a space-filling curve (z-value) to transform knn joins into a sequence of one-dimensional search. There were three pairs of mapper-reducer being employed during the process.

Social Media Mining. Data Mining Essentials

Social Media Mining. Data Mining Essentials Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

More information

An Overview of Knowledge Discovery Database and Data mining Techniques

An Overview of Knowledge Discovery Database and Data mining Techniques An Overview of Knowledge Discovery Database and Data mining Techniques Priyadharsini.C 1, Dr. Antony Selvadoss Thanamani 2 M.Phil, Department of Computer Science, NGM College, Pollachi, Coimbatore, Tamilnadu,

More information

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014 RESEARCH ARTICLE OPEN ACCESS A Survey of Data Mining: Concepts with Applications and its Future Scope Dr. Zubair Khan 1, Ashish Kumar 2, Sunny Kumar 3 M.Tech Research Scholar 2. Department of Computer

More information

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data CMPE 59H Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data Term Project Report Fatma Güney, Kübra Kalkan 1/15/2013 Keywords: Non-linear

More information

Data Mining - Evaluation of Classifiers

Data Mining - Evaluation of Classifiers Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010

More information

EFFICIENT DATA PRE-PROCESSING FOR DATA MINING

EFFICIENT DATA PRE-PROCESSING FOR DATA MINING EFFICIENT DATA PRE-PROCESSING FOR DATA MINING USING NEURAL NETWORKS JothiKumar.R 1, Sivabalan.R.V 2 1 Research scholar, Noorul Islam University, Nagercoil, India Assistant Professor, Adhiparasakthi College

More information

International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 3, May-June 2015

International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 3, May-June 2015 RESEARCH ARTICLE OPEN ACCESS Data Mining Technology for Efficient Network Security Management Ankit Naik [1], S.W. Ahmad [2] Student [1], Assistant Professor [2] Department of Computer Science and Engineering

More information

Introduction to Machine Learning. Speaker: Harry Chao Advisor: J.J. Ding Date: 1/27/2011

Introduction to Machine Learning. Speaker: Harry Chao Advisor: J.J. Ding Date: 1/27/2011 Introduction to Machine Learning Speaker: Harry Chao Advisor: J.J. Ding Date: 1/27/2011 1 Outline 1. What is machine learning? 2. The basic of machine learning 3. Principles and effects of machine learning

More information

The Data Mining Process

The Data Mining Process Sequence for Determining Necessary Data. Wrong: Catalog everything you have, and decide what data is important. Right: Work backward from the solution, define the problem explicitly, and map out the data

More information

A SURVEY ON GENETIC ALGORITHM FOR INTRUSION DETECTION SYSTEM

A SURVEY ON GENETIC ALGORITHM FOR INTRUSION DETECTION SYSTEM A SURVEY ON GENETIC ALGORITHM FOR INTRUSION DETECTION SYSTEM MS. DIMPI K PATEL Department of Computer Science and Engineering, Hasmukh Goswami college of Engineering, Ahmedabad, Gujarat ABSTRACT The Internet

More information

Is a Data Scientist the New Quant? Stuart Kozola MathWorks

Is a Data Scientist the New Quant? Stuart Kozola MathWorks Is a Data Scientist the New Quant? Stuart Kozola MathWorks 2015 The MathWorks, Inc. 1 Facts or information used usually to calculate, analyze, or plan something Information that is produced or stored by

More information

Machine Learning using MapReduce

Machine Learning using MapReduce Machine Learning using MapReduce What is Machine Learning Machine learning is a subfield of artificial intelligence concerned with techniques that allow computers to improve their outputs based on previous

More information

Using Data Mining for Mobile Communication Clustering and Characterization

Using Data Mining for Mobile Communication Clustering and Characterization Using Data Mining for Mobile Communication Clustering and Characterization A. Bascacov *, C. Cernazanu ** and M. Marcu ** * Lasting Software, Timisoara, Romania ** Politehnica University of Timisoara/Computer

More information

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning How to use Big Data in Industry 4.0 implementations LAURI ILISON, PhD Head of Big Data and Machine Learning Big Data definition? Big Data is about structured vs unstructured data Big Data is about Volume

More information

DATA MINING TECHNIQUES AND APPLICATIONS

DATA MINING TECHNIQUES AND APPLICATIONS DATA MINING TECHNIQUES AND APPLICATIONS Mrs. Bharati M. Ramageri, Lecturer Modern Institute of Information Technology and Research, Department of Computer Application, Yamunanagar, Nigdi Pune, Maharashtra,

More information

A Content based Spam Filtering Using Optical Back Propagation Technique

A Content based Spam Filtering Using Optical Back Propagation Technique A Content based Spam Filtering Using Optical Back Propagation Technique Sarab M. Hameed 1, Noor Alhuda J. Mohammed 2 Department of Computer Science, College of Science, University of Baghdad - Iraq ABSTRACT

More information

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Journal of Computational Information Systems 7: 16 (2011) 5956-5963 Available at http://www.jofcis.com Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Ping ZHOU, Jingsheng LEI, Wenjun YE

More information

Azure Machine Learning, SQL Data Mining and R

Azure Machine Learning, SQL Data Mining and R Azure Machine Learning, SQL Data Mining and R Day-by-day Agenda Prerequisites No formal prerequisites. Basic knowledge of SQL Server Data Tools, Excel and any analytical experience helps. Best of all:

More information

A Review of Data Mining Techniques

A Review of Data Mining Techniques Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 4, April 2014,

More information

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015 An Introduction to Data Mining for Wind Power Management Spring 2015 Big Data World Every minute: Google receives over 4 million search queries Facebook users share almost 2.5 million pieces of content

More information

Chapter 6. The stacking ensemble approach

Chapter 6. The stacking ensemble approach 82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described

More information

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Overview This 4-day class is the first of the two data science courses taught by Rafal Lukawiecki. Some of the topics will be

More information

DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES

DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES Vijayalakshmi Mahanra Rao 1, Yashwant Prasad Singh 2 Multimedia University, Cyberjaya, MALAYSIA 1 lakshmi.mahanra@gmail.com

More information

Similarity Search in a Very Large Scale Using Hadoop and HBase

Similarity Search in a Very Large Scale Using Hadoop and HBase Similarity Search in a Very Large Scale Using Hadoop and HBase Stanislav Barton, Vlastislav Dohnal, Philippe Rigaux LAMSADE - Universite Paris Dauphine, France Internet Memory Foundation, Paris, France

More information

Distributed Computing and Big Data: Hadoop and MapReduce

Distributed Computing and Big Data: Hadoop and MapReduce Distributed Computing and Big Data: Hadoop and MapReduce Bill Keenan, Director Terry Heinze, Architect Thomson Reuters Research & Development Agenda R&D Overview Hadoop and MapReduce Overview Use Case:

More information

Classification algorithm in Data mining: An Overview

Classification algorithm in Data mining: An Overview Classification algorithm in Data mining: An Overview S.Neelamegam #1, Dr.E.Ramaraj *2 #1 M.phil Scholar, Department of Computer Science and Engineering, Alagappa University, Karaikudi. *2 Professor, Department

More information

Memory Allocation Technique for Segregated Free List Based on Genetic Algorithm

Memory Allocation Technique for Segregated Free List Based on Genetic Algorithm Journal of Al-Nahrain University Vol.15 (2), June, 2012, pp.161-168 Science Memory Allocation Technique for Segregated Free List Based on Genetic Algorithm Manal F. Younis Computer Department, College

More information

Machine Learning Big Data using Map Reduce

Machine Learning Big Data using Map Reduce Machine Learning Big Data using Map Reduce By Michael Bowles, PhD Where Does Big Data Come From? -Web data (web logs, click histories) -e-commerce applications (purchase histories) -Retail purchase histories

More information

A Robust Method for Solving Transcendental Equations

A Robust Method for Solving Transcendental Equations www.ijcsi.org 413 A Robust Method for Solving Transcendental Equations Md. Golam Moazzam, Amita Chakraborty and Md. Al-Amin Bhuiyan Department of Computer Science and Engineering, Jahangirnagar University,

More information

D-optimal plans in observational studies

D-optimal plans in observational studies D-optimal plans in observational studies Constanze Pumplün Stefan Rüping Katharina Morik Claus Weihs October 11, 2005 Abstract This paper investigates the use of Design of Experiments in observational

More information

Genetic Algorithm. Based on Darwinian Paradigm. Intrinsically a robust search and optimization mechanism. Conceptual Algorithm

Genetic Algorithm. Based on Darwinian Paradigm. Intrinsically a robust search and optimization mechanism. Conceptual Algorithm 24 Genetic Algorithm Based on Darwinian Paradigm Reproduction Competition Survive Selection Intrinsically a robust search and optimization mechanism Slide -47 - Conceptual Algorithm Slide -48 - 25 Genetic

More information

Data quality in Accounting Information Systems

Data quality in Accounting Information Systems Data quality in Accounting Information Systems Comparing Several Data Mining Techniques Erjon Zoto Department of Statistics and Applied Informatics Faculty of Economy, University of Tirana Tirana, Albania

More information

Alpha Cut based Novel Selection for Genetic Algorithm

Alpha Cut based Novel Selection for Genetic Algorithm Alpha Cut based Novel for Genetic Algorithm Rakesh Kumar Professor Girdhar Gopal Research Scholar Rajesh Kumar Assistant Professor ABSTRACT Genetic algorithm (GA) has several genetic operators that can

More information

Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining Knowledge Discovery and Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Evaluating the Accuracy of a Classifier Holdout, random subsampling, crossvalidation, and the bootstrap are common techniques for

More information

BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES

BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES 123 CHAPTER 7 BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES 7.1 Introduction Even though using SVM presents

More information

Chapter 12 Discovering New Knowledge Data Mining

Chapter 12 Discovering New Knowledge Data Mining Chapter 12 Discovering New Knowledge Data Mining Becerra-Fernandez, et al. -- Knowledge Management 1/e -- 2004 Prentice Hall Additional material 2007 Dekai Wu Chapter Objectives Introduce the student to

More information

Neural Network and Genetic Algorithm Based Trading Systems. Donn S. Fishbein, MD, PhD Neuroquant.com

Neural Network and Genetic Algorithm Based Trading Systems. Donn S. Fishbein, MD, PhD Neuroquant.com Neural Network and Genetic Algorithm Based Trading Systems Donn S. Fishbein, MD, PhD Neuroquant.com Consider the challenge of constructing a financial market trading system using commonly available technical

More information

Supervised Feature Selection & Unsupervised Dimensionality Reduction

Supervised Feature Selection & Unsupervised Dimensionality Reduction Supervised Feature Selection & Unsupervised Dimensionality Reduction Feature Subset Selection Supervised: class labels are given Select a subset of the problem features Why? Redundant features much or

More information

Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques

Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques Subhashree K 1, Prakash P S 2 1 Student, Kongu Engineering College, Perundurai, Erode 2 Assistant Professor,

More information

Final Project Report

Final Project Report CPSC545 by Introduction to Data Mining Prof. Martin Schultz & Prof. Mark Gerstein Student Name: Yu Kor Hugo Lam Student ID : 904907866 Due Date : May 7, 2007 Introduction Final Project Report Pseudogenes

More information

14.10.2014. Overview. Swarms in nature. Fish, birds, ants, termites, Introduction to swarm intelligence principles Particle Swarm Optimization (PSO)

14.10.2014. Overview. Swarms in nature. Fish, birds, ants, termites, Introduction to swarm intelligence principles Particle Swarm Optimization (PSO) Overview Kyrre Glette kyrrehg@ifi INF3490 Swarm Intelligence Particle Swarm Optimization Introduction to swarm intelligence principles Particle Swarm Optimization (PSO) 3 Swarms in nature Fish, birds,

More information

Classification of Bad Accounts in Credit Card Industry

Classification of Bad Accounts in Credit Card Industry Classification of Bad Accounts in Credit Card Industry Chengwei Yuan December 12, 2014 Introduction Risk management is critical for a credit card company to survive in such competing industry. In addition

More information

Energy Efficient MapReduce

Energy Efficient MapReduce Energy Efficient MapReduce Motivation: Energy consumption is an important aspect of datacenters efficiency, the total power consumption in the united states has doubled from 2000 to 2005, representing

More information

The Scientific Data Mining Process

The Scientific Data Mining Process Chapter 4 The Scientific Data Mining Process When I use a word, Humpty Dumpty said, in rather a scornful tone, it means just what I choose it to mean neither more nor less. Lewis Carroll [87, p. 214] In

More information

A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier

A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier G.T. Prasanna Kumari Associate Professor, Dept of Computer Science and Engineering, Gokula Krishna College of Engg, Sullurpet-524121,

More information

1. Classification problems

1. Classification problems Neural and Evolutionary Computing. Lab 1: Classification problems Machine Learning test data repository Weka data mining platform Introduction Scilab 1. Classification problems The main aim of a classification

More information

IMPROVING DATA INTEGRATION FOR DATA WAREHOUSE: A DATA MINING APPROACH

IMPROVING DATA INTEGRATION FOR DATA WAREHOUSE: A DATA MINING APPROACH IMPROVING DATA INTEGRATION FOR DATA WAREHOUSE: A DATA MINING APPROACH Kalinka Mihaylova Kaloyanova St. Kliment Ohridski University of Sofia, Faculty of Mathematics and Informatics Sofia 1164, Bulgaria

More information

AUTOMATION OF ENERGY DEMAND FORECASTING. Sanzad Siddique, B.S.

AUTOMATION OF ENERGY DEMAND FORECASTING. Sanzad Siddique, B.S. AUTOMATION OF ENERGY DEMAND FORECASTING by Sanzad Siddique, B.S. A Thesis submitted to the Faculty of the Graduate School, Marquette University, in Partial Fulfillment of the Requirements for the Degree

More information

Hadoop Operations Management for Big Data Clusters in Telecommunication Industry

Hadoop Operations Management for Big Data Clusters in Telecommunication Industry Hadoop Operations Management for Big Data Clusters in Telecommunication Industry N. Kamalraj Asst. Prof., Department of Computer Technology Dr. SNS Rajalakshmi College of Arts and Science Coimbatore-49

More information

Distributed forests for MapReduce-based machine learning

Distributed forests for MapReduce-based machine learning Distributed forests for MapReduce-based machine learning Ryoji Wakayama, Ryuei Murata, Akisato Kimura, Takayoshi Yamashita, Yuji Yamauchi, Hironobu Fujiyoshi Chubu University, Japan. NTT Communication

More information

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS 1 AND ALGORITHMS Chiara Renso KDD-LAB ISTI- CNR, Pisa, Italy WHAT IS CLUSTER ANALYSIS? Finding groups of objects such that the objects in a group will be similar

More information

Evolutionary SAT Solver (ESS)

Evolutionary SAT Solver (ESS) Ninth LACCEI Latin American and Caribbean Conference (LACCEI 2011), Engineering for a Smart Planet, Innovation, Information Technology and Computational Tools for Sustainable Development, August 3-5, 2011,

More information

Lecture 10: Regression Trees

Lecture 10: Regression Trees Lecture 10: Regression Trees 36-350: Data Mining October 11, 2006 Reading: Textbook, sections 5.2 and 10.5. The next three lectures are going to be about a particular kind of nonlinear predictive model,

More information

Research on Clustering Analysis of Big Data Yuan Yuanming 1, 2, a, Wu Chanle 1, 2

Research on Clustering Analysis of Big Data Yuan Yuanming 1, 2, a, Wu Chanle 1, 2 Advanced Engineering Forum Vols. 6-7 (2012) pp 82-87 Online: 2012-09-26 (2012) Trans Tech Publications, Switzerland doi:10.4028/www.scientific.net/aef.6-7.82 Research on Clustering Analysis of Big Data

More information

Data Clustering. Dec 2nd, 2013 Kyrylo Bessonov

Data Clustering. Dec 2nd, 2013 Kyrylo Bessonov Data Clustering Dec 2nd, 2013 Kyrylo Bessonov Talk outline Introduction to clustering Types of clustering Supervised Unsupervised Similarity measures Main clustering algorithms k-means Hierarchical Main

More information

Machine Learning Final Project Spam Email Filtering

Machine Learning Final Project Spam Email Filtering Machine Learning Final Project Spam Email Filtering March 2013 Shahar Yifrah Guy Lev Table of Content 1. OVERVIEW... 3 2. DATASET... 3 2.1 SOURCE... 3 2.2 CREATION OF TRAINING AND TEST SETS... 4 2.3 FEATURE

More information

TOWARD BIG DATA ANALYSIS WORKSHOP

TOWARD BIG DATA ANALYSIS WORKSHOP TOWARD BIG DATA ANALYSIS WORKSHOP 邁 向 巨 量 資 料 分 析 研 討 會 摘 要 集 2015.06.05-06 巨 量 資 料 之 矩 陣 視 覺 化 陳 君 厚 中 央 研 究 院 統 計 科 學 研 究 所 摘 要 視 覺 化 (Visualization) 與 探 索 式 資 料 分 析 (Exploratory Data Analysis, EDA)

More information

Predicting the Risk of Heart Attacks using Neural Network and Decision Tree

Predicting the Risk of Heart Attacks using Neural Network and Decision Tree Predicting the Risk of Heart Attacks using Neural Network and Decision Tree S.Florence 1, N.G.Bhuvaneswari Amma 2, G.Annapoorani 3, K.Malathi 4 PG Scholar, Indian Institute of Information Technology, Srirangam,

More information

Comparison of K-means and Backpropagation Data Mining Algorithms

Comparison of K-means and Backpropagation Data Mining Algorithms Comparison of K-means and Backpropagation Data Mining Algorithms Nitu Mathuriya, Dr. Ashish Bansal Abstract Data mining has got more and more mature as a field of basic research in computer science and

More information

STATISTICA Formula Guide: Logistic Regression. Table of Contents

STATISTICA Formula Guide: Logistic Regression. Table of Contents : Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 Sigma-Restricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary

More information

An Introduction to Data Mining

An Introduction to Data Mining An Introduction to Intel Beijing wei.heng@intel.com January 17, 2014 Outline 1 DW Overview What is Notable Application of Conference, Software and Applications Major Process in 2 Major Tasks in Detail

More information

The Impact of Big Data on Classic Machine Learning Algorithms. Thomas Jensen, Senior Business Analyst @ Expedia

The Impact of Big Data on Classic Machine Learning Algorithms. Thomas Jensen, Senior Business Analyst @ Expedia The Impact of Big Data on Classic Machine Learning Algorithms Thomas Jensen, Senior Business Analyst @ Expedia Who am I? Senior Business Analyst @ Expedia Working within the competitive intelligence unit

More information

D A T A M I N I N G C L A S S I F I C A T I O N

D A T A M I N I N G C L A S S I F I C A T I O N D A T A M I N I N G C L A S S I F I C A T I O N FABRICIO VOZNIKA LEO NARDO VIA NA INTRODUCTION Nowadays there is huge amount of data being collected and stored in databases everywhere across the globe.

More information

Prediction of Stock Performance Using Analytical Techniques

Prediction of Stock Performance Using Analytical Techniques 136 JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 5, NO. 2, MAY 2013 Prediction of Stock Performance Using Analytical Techniques Carol Hargreaves Institute of Systems Science National University

More information

BIG DATA IN HEALTHCARE THE NEXT FRONTIER

BIG DATA IN HEALTHCARE THE NEXT FRONTIER BIG DATA IN HEALTHCARE THE NEXT FRONTIER Divyaa Krishna Sonnad 1, Dr. Jharna Majumdar 2 2 Dean R&D, Prof. and Head, 1,2 Dept of CSE (PG), Nitte Meenakshi Institute of Technology Abstract: The world of

More information

Genetic Algorithms commonly used selection, replacement, and variation operators Fernando Lobo University of Algarve

Genetic Algorithms commonly used selection, replacement, and variation operators Fernando Lobo University of Algarve Genetic Algorithms commonly used selection, replacement, and variation operators Fernando Lobo University of Algarve Outline Selection methods Replacement methods Variation operators Selection Methods

More information

Principles of Data Mining by Hand&Mannila&Smyth

Principles of Data Mining by Hand&Mannila&Smyth Principles of Data Mining by Hand&Mannila&Smyth Slides for Textbook Ari Visa,, Institute of Signal Processing Tampere University of Technology October 4, 2010 Data Mining: Concepts and Techniques 1 Differences

More information

A Survey on Pre-processing and Post-processing Techniques in Data Mining

A Survey on Pre-processing and Post-processing Techniques in Data Mining , pp. 99-128 http://dx.doi.org/10.14257/ijdta.2014.7.4.09 A Survey on Pre-processing and Post-processing Techniques in Data Mining Divya Tomar and Sonali Agarwal Indian Institute of Information Technology,

More information

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database Managing Big Data with Hadoop & Vertica A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database Copyright Vertica Systems, Inc. October 2009 Cloudera and Vertica

More information

Constrained Classification of Large Imbalanced Data by Logistic Regression and Genetic Algorithm

Constrained Classification of Large Imbalanced Data by Logistic Regression and Genetic Algorithm Constrained Classification of Large Imbalanced Data by Logistic Regression and Genetic Algorithm Martin Hlosta, Rostislav Stríž, Jan Kupčík, Jaroslav Zendulka, and Tomáš Hruška A. Imbalanced Data Classification

More information

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop Kanchan A. Khedikar Department of Computer Science & Engineering Walchand Institute of Technoloy, Solapur, Maharashtra,

More information

Pentaho Data Mining Last Modified on January 22, 2007

Pentaho Data Mining Last Modified on January 22, 2007 Pentaho Data Mining Copyright 2007 Pentaho Corporation. Redistribution permitted. All trademarks are the property of their respective owners. For the latest information, please visit our web site at www.pentaho.org

More information

Analytics on Big Data

Analytics on Big Data Analytics on Big Data Riccardo Torlone Università Roma Tre Credits: Mohamed Eltabakh (WPI) Analytics The discovery and communication of meaningful patterns in data (Wikipedia) It relies on data analysis

More information

Comparison of Data Mining Techniques used for Financial Data Analysis

Comparison of Data Mining Techniques used for Financial Data Analysis Comparison of Data Mining Techniques used for Financial Data Analysis Abhijit A. Sawant 1, P. M. Chawan 2 1 Student, 2 Associate Professor, Department of Computer Technology, VJTI, Mumbai, INDIA Abstract

More information

Big Data with Rough Set Using Map- Reduce

Big Data with Rough Set Using Map- Reduce Big Data with Rough Set Using Map- Reduce Mr.G.Lenin 1, Mr. A. Raj Ganesh 2, Mr. S. Vanarasan 3 Assistant Professor, Department of CSE, Podhigai College of Engineering & Technology, Tirupattur, Tamilnadu,

More information

Genetic Algorithms and Sudoku

Genetic Algorithms and Sudoku Genetic Algorithms and Sudoku Dr. John M. Weiss Department of Mathematics and Computer Science South Dakota School of Mines and Technology (SDSM&T) Rapid City, SD 57701-3995 john.weiss@sdsmt.edu MICS 2009

More information

CLASSIFICATION AND CLUSTERING. Anveshi Charuvaka

CLASSIFICATION AND CLUSTERING. Anveshi Charuvaka CLASSIFICATION AND CLUSTERING Anveshi Charuvaka Learning from Data Classification Regression Clustering Anomaly Detection Contrast Set Mining Classification: Definition Given a collection of records (training

More information

Data Mining Part 5. Prediction

Data Mining Part 5. Prediction Data Mining Part 5. Prediction 5.1 Spring 2010 Instructor: Dr. Masoud Yaghini Outline Classification vs. Numeric Prediction Prediction Process Data Preparation Comparing Prediction Methods References Classification

More information

Clustering. Data Mining. Abraham Otero. Data Mining. Agenda

Clustering. Data Mining. Abraham Otero. Data Mining. Agenda Clustering 1/46 Agenda Introduction Distance K-nearest neighbors Hierarchical clustering Quick reference 2/46 1 Introduction It seems logical that in a new situation we should act in a similar way as in

More information

Classification On The Clouds Using MapReduce

Classification On The Clouds Using MapReduce Classification On The Clouds Using MapReduce Simão Martins Instituto Superior Técnico Lisbon, Portugal simao.martins@tecnico.ulisboa.pt Cláudia Antunes Instituto Superior Técnico Lisbon, Portugal claudia.antunes@tecnico.ulisboa.pt

More information

Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches

Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches PhD Thesis by Payam Birjandi Director: Prof. Mihai Datcu Problematic

More information

Data Mining Solutions for the Business Environment

Data Mining Solutions for the Business Environment Database Systems Journal vol. IV, no. 4/2013 21 Data Mining Solutions for the Business Environment Ruxandra PETRE University of Economic Studies, Bucharest, Romania ruxandra_stefania.petre@yahoo.com Over

More information

Predict Influencers in the Social Network

Predict Influencers in the Social Network Predict Influencers in the Social Network Ruishan Liu, Yang Zhao and Liuyu Zhou Email: rliu2, yzhao2, lyzhou@stanford.edu Department of Electrical Engineering, Stanford University Abstract Given two persons

More information

lop Building Machine Learning Systems with Python en source

lop Building Machine Learning Systems with Python en source Building Machine Learning Systems with Python Master the art of machine learning with Python and build effective machine learning systems with this intensive handson guide Willi Richert Luis Pedro Coelho

More information

Chapter ML:XI (continued)

Chapter ML:XI (continued) Chapter ML:XI (continued) XI. Cluster Analysis Data Mining Overview Cluster Analysis Basics Hierarchical Cluster Analysis Iterative Cluster Analysis Density-Based Cluster Analysis Cluster Evaluation Constrained

More information

Chapter 20: Data Analysis

Chapter 20: Data Analysis Chapter 20: Data Analysis Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Chapter 20: Data Analysis Decision Support Systems Data Warehousing Data Mining Classification

More information

RevoScaleR Speed and Scalability

RevoScaleR Speed and Scalability EXECUTIVE WHITE PAPER RevoScaleR Speed and Scalability By Lee Edlefsen Ph.D., Chief Scientist, Revolution Analytics Abstract RevoScaleR, the Big Data predictive analytics library included with Revolution

More information

DATA PREPARATION FOR DATA MINING

DATA PREPARATION FOR DATA MINING Applied Artificial Intelligence, 17:375 381, 2003 Copyright # 2003 Taylor & Francis 0883-9514/03 $12.00 +.00 DOI: 10.1080/08839510390219264 u DATA PREPARATION FOR DATA MINING SHICHAO ZHANG and CHENGQI

More information

Data Mining. Shahram Hassas Math 382 Professor: Shapiro

Data Mining. Shahram Hassas Math 382 Professor: Shapiro Data Mining Shahram Hassas Math 382 Professor: Shapiro Agenda Introduction Major Elements Steps/ Processes Examples Tools used for data mining Advantages and Disadvantages What is Data Mining? Described

More information

MHI3000 Big Data Analytics for Health Care Final Project Report

MHI3000 Big Data Analytics for Health Care Final Project Report MHI3000 Big Data Analytics for Health Care Final Project Report Zhongtian Fred Qiu (1002274530) http://gallery.azureml.net/details/81ddb2ab137046d4925584b5095ec7aa 1. Data pre-processing The data given

More information

Classifying Large Data Sets Using SVMs with Hierarchical Clusters. Presented by :Limou Wang

Classifying Large Data Sets Using SVMs with Hierarchical Clusters. Presented by :Limou Wang Classifying Large Data Sets Using SVMs with Hierarchical Clusters Presented by :Limou Wang Overview SVM Overview Motivation Hierarchical micro-clustering algorithm Clustering-Based SVM (CB-SVM) Experimental

More information

Map-Reduce for Machine Learning on Multicore

Map-Reduce for Machine Learning on Multicore Map-Reduce for Machine Learning on Multicore Chu, et al. Problem The world is going multicore New computers - dual core to 12+-core Shift to more concurrent programming paradigms and languages Erlang,

More information

Data Algorithms. Mahmoud Parsian. Tokyo O'REILLY. Beijing. Boston Farnham Sebastopol

Data Algorithms. Mahmoud Parsian. Tokyo O'REILLY. Beijing. Boston Farnham Sebastopol Data Algorithms Mahmoud Parsian Beijing Boston Farnham Sebastopol Tokyo O'REILLY Table of Contents Foreword xix Preface xxi 1. Secondary Sort: Introduction 1 Solutions to the Secondary Sort Problem 3 Implementation

More information

Advanced In-Database Analytics

Advanced In-Database Analytics Advanced In-Database Analytics Tallinn, Sept. 25th, 2012 Mikko-Pekka Bertling, BDM Greenplum EMEA 1 That sounds complicated? 2 Who can tell me how best to solve this 3 What are the main mathematical functions??

More information

Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing and Developing E-mail Classifier

Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing and Developing E-mail Classifier International Journal of Recent Technology and Engineering (IJRTE) ISSN: 2277-3878, Volume-1, Issue-6, January 2013 Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing

More information

Data Mining Techniques for Prognosis in Pancreatic Cancer

Data Mining Techniques for Prognosis in Pancreatic Cancer Data Mining Techniques for Prognosis in Pancreatic Cancer by Stuart Floyd A Thesis Submitted to the Faculty of the WORCESTER POLYTECHNIC INSTITUE In partial fulfillment of the requirements for the Degree

More information

Scalable Developments for Big Data Analytics in Remote Sensing

Scalable Developments for Big Data Analytics in Remote Sensing Scalable Developments for Big Data Analytics in Remote Sensing Federated Systems and Data Division Research Group High Productivity Data Processing Dr.-Ing. Morris Riedel et al. Research Group Leader,

More information

Gerry Hobbs, Department of Statistics, West Virginia University

Gerry Hobbs, Department of Statistics, West Virginia University Decision Trees as a Predictive Modeling Method Gerry Hobbs, Department of Statistics, West Virginia University Abstract Predictive modeling has become an important area of interest in tasks such as credit

More information

Neural Networks and Back Propagation Algorithm

Neural Networks and Back Propagation Algorithm Neural Networks and Back Propagation Algorithm Mirza Cilimkovic Institute of Technology Blanchardstown Blanchardstown Road North Dublin 15 Ireland mirzac@gmail.com Abstract Neural Networks (NN) are important

More information