Distributed Data Mining Algorithm Parallelization

Similar documents
DEVELOPMENT OF HASH TABLE BASED WEB-READY DATA MINING ENGINE

Building A Smart Academic Advising System Using Association Rule Mining

Distributed Apriori in Hadoop MapReduce Framework

Project Report. 1. Application Scenario

New Matrix Approach to Improve Apriori Algorithm

Improving Apriori Algorithm to get better performance with Cloud Computing

MINING THE DATA FROM DISTRIBUTED DATABASE USING AN IMPROVED MINING ALGORITHM

A Time Efficient Algorithm for Web Log Analysis

Laboratory Module 8 Mining Frequent Itemsets Apriori Algorithm

Data Mining Apriori Algorithm

Finding Frequent Patterns Based On Quantitative Binary Attributes Using FP-Growth Algorithm

Discovery of Maximal Frequent Item Sets using Subset Creation

Data Mining: Foundation, Techniques and Applications

Mining Online GIS for Crime Rate and Models based on Frequent Pattern Analysis

Classification and Prediction

Mining Interesting Medical Knowledge from Big Data

Mining an Online Auctions Data Warehouse

Data Mining: Partially from: Introduction to Data Mining by Tan, Steinbach, Kumar

Data Mining with R. Decision Trees and Random Forests. Hugh Murrell

ANALYSIS OF GRID COMPUTING AS IT APPLIES TO HIGH VOLUME DOCUMENT PROCESSING AND OCR

A Fraud Detection Approach in Telecommunication using Cluster GA

Databases - Data Mining. (GF Royle, N Spadaccini ) Databases - Data Mining 1 / 25

A Way to Understand Various Patterns of Data Mining Techniques for Selected Domains

Selection of Optimal Discount of Retail Assortments with Data Mining Approach

Binary Coded Web Access Pattern Tree in Education Domain

ASSOCIATION RULE MINING ON WEB LOGS FOR EXTRACTING INTERESTING PATTERNS THROUGH WEKA TOOL

Performance Evaluation of some Online Association Rule Mining Algorithms for sorted and unsorted Data sets

Association Rule Mining

MAXIMAL FREQUENT ITEMSET GENERATION USING SEGMENTATION APPROACH

Data Mining Applications in Manufacturing

Association Technique on Prediction of Chronic Diseases Using Apriori Algorithm

Keywords: Mobility Prediction, Location Prediction, Data Mining etc

Clinic + - A Clinical Decision Support System Using Association Rule Mining

Data Mining Approach in Security Information and Event Management

COMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction

Chapter 4 Data Mining A Short Introduction. 2006/7, Karl Aberer, EPFL-IC, Laboratoire de systèmes d'informations répartis Data Mining - 1

A Breadth-First Algorithm for Mining Frequent Patterns from Event Logs

COMBINED METHODOLOGY of the CLASSIFICATION RULES for MEDICAL DATA-SETS

Classification of IDS Alerts with Data Mining Techniques

Analysis of Customer Behavior using Clustering and Association Rules

Data Mining Classification: Decision Trees

A Data Mining Tutorial

Using Data Mining Methods to Predict Personally Identifiable Information in s

Distributed Systems / Middleware Distributed Programming in Erlang

Static Data Mining Algorithm with Progressive Approach for Mining Knowledge

Performance Analysis of Apriori Algorithm with Different Data Structures on Hadoop Cluster

CHAPTER 1 INTRODUCTION

Case study: d60 Raptor smartadvisor. Jan Neerbek Alexandra Institute

A hybrid algorithm combining weighted and hasht apriori algorithms in Map Reduce model using Eucalyptus cloud platform

IJRFM Volume 2, Issue 1 (January 2012) (ISSN )

A Survey on Association Rule Mining in Market Basket Analysis

The Fuzzy Frequent Pattern Tree

Web Document Clustering

Code and Process Migration! Motivation!

So today we shall continue our discussion on the search engines and web crawlers. (Refer Slide Time: 01:02)

Intelligent Log Analyzer. André Restivo

AN APPLICATION OF INFORMATION RETRIEVAL IN P2P NETWORKS USING SOCKETS AND METADATA

KNIME TUTORIAL. Anna Monreale KDD-Lab, University of Pisa

Searching frequent itemsets by clustering data

Dual Mechanism to Detect DDOS Attack Priyanka Dembla, Chander Diwaker 2 1 Research Scholar, 2 Assistant Professor

Data Mining to Recognize Fail Parts in Manufacturing Process

Implementing Graph Pattern Mining for Big Data in the Cloud

MINING ASSOCIATION RULES FROM LARGE DATA BASES- A REVIEW

Map-Reduce for Machine Learning on Multicore

Implementing Improved Algorithm Over APRIORI Data Mining Association Rule Algorithm

RDB-MINER: A SQL-Based Algorithm for Mining True Relational Databases

Chapter 13: Query Processing. Basic Steps in Query Processing

Association Analysis: Basic Concepts and Algorithms

Operating Systems for Parallel Processing Assistent Lecturer Alecu Felician Economic Informatics Department Academy of Economic Studies Bucharest

RESEARCH PAPER International Journal of Recent Trends in Engineering, Vol 1, No. 1, May 2009

The basic data mining algorithms introduced may be enhanced in a number of ways.

DATA ANALYSIS II. Matrix Algorithms

Professor Anita Wasilewska. Classification Lecture Notes

Scala Actors Library. Robert Hilbrich

Benchmarking Hadoop & HBase on Violin

Chapter 20: Data Analysis

Data Mining for Knowledge Management. Classification

Data Mining: An Overview. David Madigan

PREDICTIVE MODELING OF INTER-TRANSACTION ASSOCIATION RULES A BUSINESS PERSPECTIVE

Overview Motivating Examples Interleaving Model Semantics of Correctness Testing, Debugging, and Verification

International Journal of World Research, Vol: I Issue XIII, December 2008, Print ISSN: X DATA MINING TECHNIQUES AND STOCK MARKET

A Survey on Intrusion Detection System with Data Mining Techniques

Transcription:

Distributed Data Mining Algorithm Parallelization B.Tech Project Report By: Rishi Kumar Singh (Y6389) Abhishek Ranjan (10030) Project Guide: Prof. Satyadev Nandakumar Department of Computer Science and Engineering IIT Kanpur

Acknowledgements We would like to thank our project guide Prof. Satyadev Nandakumar for guiding us throughout the project tenure and giving necessary advice and instructions in this project. He has been truly supporting and helped us whenever we were in need. We thank him for allotting some precious time from his busy schedule without which it would be very hard for us to proceed in the project.

Abstract Data mining is processing of large amount of data and extracting useful information from it. The goal of the project is to use Erlang to parallelize and make the work simpler using distributed systems. To do so, we take an Association Rule Mining algorithm, Apriori, formulate its distributed version and implement them using erlang. We then check the code by running it on multiple machines setup with several nodes on each, in parallel using message passing from erlang. This algorithm is very easily coded in Erlang using very minimal number of lines of coding. Seeing this, we are encouraged to use Erlang to parallelize many other similar data mining algorithms. Introduction Data Mining and Association Rule Mining Data Mining is the computational process of discovering useful patterns in large data sets. These data patterns can be then transformed to some understandable form for analyzing or other further uses. Association Rule Mining is a type of Data Mining where specific relations between the items of large data sets are found and strong rules are formulated connecting some set of items to another. This method is widely used in super markets where they have large number of items and large data of the transactions made by many customers. The mining algorithms can be used in this case to find which the item sets are likely to be bought with some other sets of items. Erlang and Parallelization Data Mining algorithms are very important in the present world and are widely used. The main problem associated with it is the large data sets on which the algorithms are to be run. The solution to this is parallelization of the algorithms. This would increase the efficiency of the algorithms drastically in terms of time and memory. We can use a dynamically typed functional language, Erlang for parallelization of those algorithms. Erlang supports lightweight threads and message passing model of concurrency, which is what makes it suitable for the purpose.

Erlang Programming Language The efficiency in Erlang programming language is achieved by two methods. One is Concurrent Programming and the other is Distributed programming. Concurrent programming will make the algorithm time efficient and Distributed programming will make the algorithm utilize maximum of the resources in hand from all the connected systems. In Erlang, concurrency is achieved by Processes and Message Passing and distribution is achieved by creating Erlang Nodes. Erlang Processes Concurrency in erlang is done by message passing between processes for communication. Erlang processes are lightweight threads. There is inbuilt function spawn define in Erlang for creating process: spawn (Module_Name, Function_Name, Argument_List) We can create a process from the current node to another node by using one extra argument in spawn function: spawn (Node_Name, Module_Name, Function_Name, Argument_List) spawn returns the process identifier (pid) of the created process. Pids are used for communication with a process. Message Passing in Erlang Communication between processes is done by message passing in erlang. Receiver_Pid! Message! is the operator used for sending messages. Erlang uses asynchronous message passing mechanism, means sender will not wait for the acknowledgment from the receiver. Every process have a mailbox which they used for storing received messages. They used pattern matching for selecting messages from the mailbox. receive end pattern1 -> task1;... patternn -> taskn We can also register a process by a name and used that name instead of pid for communication.

Erlang Node Distributed programming in erlang is done by using erlang nodes. Erlang processes run on these nodes. We can create nodes on same host or different host. The syntax for this is erl -name node_name@host_name -setcookie cookie_name host_name is the IP address of the machine or machine name. Cookies are used to handle the security issues. If any node want to connect to a particular node then it must use the same cookie. We can use simpler version of above syntax for creating nodes on same machine. The syntax is erl -sname node_name Security is not the problem here because nodes are created on the same machine. Association Rule Association rule mining is used to capture all possible rules that can explain the presence of some sets of data items in the presence of some other set of data items. Association Rules are of the form: {Ia1, Ia2,., Ian} {Ib1, Ib2,.., Ibm} This says that whenever the set of items on left hand side occur in a transaction, the second itemset will also occur. It is highly unlikely to find such association of two sets of data for every transaction. So we use thresholds to find such strong association rules which hold in at least the minimum threshold number of transactions. For this purpose we assign two terms, Support and Confidence. Both of these terms have a threshold, support threshold and confidence threshold. Support is defined as ratio of occurrence of a set of items to the total number of transactions. Confidence for a rule, A B, is defined as ratio of occurrence of B whenever A occurs to the total occurrence of A. Owing to these terms, Frequent itemset can be defined as an itemset whose support is more than or equal to the support threshold. Also, a strong association rule is said to be an association rule whose confidence is more than or equal to the confidence threshold. Finally, the mining of association rules is done in two steps: Finding Frequent Itemsets Generating Strong Association Rules

The step of finding frequent itemsets is heavy in terms of time consumption as it involves scanning of large data sets. Various algorithms are present to find frequent itemsets. One of the basic and efficient algorithm is Apriori Algorithm. This is the algorithm we are trying to implement concurrently in this project. Apriori Algorithm Apriori is an algorithm to find frequent itemsets from a given set of transactions present in large data sets. It approach is to find frequency of sets of items and then pruning the list by using the given threshold value. It finds the frequency of all 1- item sets and then prune it, then 2-item sets and then prune it and continue this till no further n-item sets can be formed or the size of item sets have reached the total number of distinct items in the data set. To form (k+1)-item set, it uses frequent k- item set. The algorithm uses the fact that all subsets of a frequent itemset are also frequent. Finally all the frequent n-itemsets are combined to form all the frequent itemsets which will be the output of the algorithm. Conventional Apriori Algorithm The Apriori Algorithm works is a Candidate-generation-and-test paradigm. Its working principle is: If an itemset is frequent, then all its subsets are also frequent. This can also be stated as: If an itemset is not frequent, all of its supersets are also not frequent. The Algorithm generates candidate itemsets in increasing order of length. It then prunes the candidate itemset using the given support threshold to for the frequent itemset. Then it uses all frequent itemsets of a particular length to form candidate itemsets having length one more. The algorithm terminates when there is no more candidate itemset or the length of candidate itemsets reach the total number of items in the dataset. Stepwise Algorithm Initialization o 1-Candidate <- generate 1-item candidate set by finding frequency of each item. o call Gen_freq_set( 1-Candidate) and store it as 1-Frequent Gen_freq_set( k-candidate) o If k-candidate is empty or k equals number of items in the data set -> terminate

o k-frequent <- prune k-candidate using support threshold o Store the k-frequent Set o call Gen_cand_set( k-frequent) Gen_cand_set( k-frequent) o Join two candidates whose k-1 items are common o (k+1)-candidate <- merge two k-frequent such that k-1 items in both are same o Create subsets of (k+1)-candidate of size k and check if they are frequent. If true then keep the (k+1)-itemset in the candidate set or else discard it o call Gen_freq_set( (k+1)-candidate) Combine all the Frequent sets and return it. Parallelized Apriori Algorithm To parallelize the above algorithm and running it on distributed systems, we have to use approaches which includes division of large data sets in smaller chunks and multithreaded memory-sharing parallel algorithm. The process of dividing the data set and combining it again is repeatedly done for finding each k-item frequent set. For this purpose, many nodes are created on different systems which would carry out the operations of generating and candidate itemsets for the part of data set given to each one of them. There is also a main node which is there to communicate with all its child nodes through message passing. The function of the main node is to divide the large data set into smaller chunks and send it to all the computing nodes. The main node also collects candidate itemsets found from all the computing nodes, combine them and perform pruning to form the frequent itemsets. It then again sends this frequent itemset to each of the computing nodes to form candidate itemsets of size one higher. The process terminates with the termination of the main node which terminates after outputting the final frequent itemsets.

Stepwise Algorithm Create Main node and N Computing nodes Task of Main Node o Generation of 1-item frequent set Divide the data set in N parts and send it to all computing nodes Wait for receiving 1-item Candidate sets as messages from each computing node Combine all the small candidate sets to form a new candidate set -> 1-Candidate Call Gen_freq_set( 1-Candidate) Send 1-Frequent set as message to each computing node for finding small 2-item candidate sets. o Gen_freq_set( k-candidate) If k-candidate is empty or k equals number of items in the data set -> terminate k-frequent <- prune k-candidate using support threshold Store and return the k-frequent set o Collect_cand_set( ) Wait for receiving k-item Candidate set as messages from each computing node Combine all the small candidate sets to form a new candidate set -> k-candidate Call Gen_freq_set( k-candidate) If Gen_freq_set terminates Combine all the Frequent itemsets and return it Send terminate as message to each computing node terminate Else send the k-frequent set as message to each computing node for finding small (k+1)-item candidate sets Call Collect_cand_set( )

Task of Each Computing Node o Generation of 1-item candidate set Wait for the small data set as message from the Main node Count the occurrence of each item in the data set and store it as 1-item candidate set Send the 1-item candidate set to the main node for combining. o Gen_cand_set( ) Wait for terminate as message from Main node Terminate this node Wait for the k-frequent set as message from the Main node Join two candidates whose k-1 items are common (k+1)-candidate <- merge two k-frequent such that k-1 items in both are same Create subsets of (k+1)-candidate of size k and check if they are frequent If true then keep the (k+1)-itemset in the candidate set or else discard it Send the (k+1)-candidate set as message to the Main node for combining. Call Gen_cand_set( ) We have implement the first part of the algorithm that is to generate 1-item Frequent set. With this, the main frame of the algorithm is coded. The second part will require making the already coded processes to run recursively.

Observation We created varying number of nodes on one system, ran a particular Data set (no. of transactions = 100,000) for each case and recorded the time taken in each case to find 1-item Frequent set. The readings are as shown below No. of Nodes Execution Time (ms) 1 1208851 5 179889 10 87782 20 38480 30 22950 40 16503 50 19124 60 29217 70 30060 The table shows that the execution time decreases with increase in the number of node. But for much higher number of nodes, the execution time starts increasing with increase in the number of nodes. This happens as we are creating all the nodes on the same system, so, the creation takes time. Also, more number of nodes means division of data in more parts and collection of data from more parts before proceeding. This also increases the overall execution time considerably. Conclusion Observing the data from the above table, we can say that distributed programming is a very powerful feature of Erlang which can be used to significantly improve the efficiency of some Data Mining algorithm which uses large data sets. Erlang is a very simple functional language with small number of concepts. Yet, it can be used to code very complex algorithm in very few number of lines compared to other programming language. Apart from data mining algorithms, it can also be used to program softwares which can use its concurrency feature to improve performance.

References: http://www.erlang.org http://web.cse.iitk.ac.in/users/cs685/lectures.html A parallel Association Rule Mining algorithm by Zhi-gang Wang and Chi-She Wang. In: Web information System and Mining Lecture Notes in Computer Science Volume 7529, 2012, pp 125-129 Concurrent Programming in Erlang by Joe Armstrong, Robert Virding, Claes Wikstrom and Mike Williams Data Mining and Analysis: Fundamental Concepts and Algorithms by Mohammed J. Zaki and Wagner Meira Jr.