Distributed Data Mining Algorithm Parallelization

Distributed Data Mining Algorithm Parallelization B.Tech Project Report By: Rishi Kumar Singh (Y6389) Abhishek Ranjan (10030) Project Guide: Prof. Satyadev Nandakumar Department of Computer Science and Engineering IIT Kanpur

Acknowledgements We would like to thank our project guide Prof. Satyadev Nandakumar for guiding us throughout the project tenure and giving necessary advice and instructions in this project. He has been truly supporting and helped us whenever we were in need. We thank him for allotting some precious time from his busy schedule without which it would be very hard for us to proceed in the project.

Abstract Data mining is processing of large amount of data and extracting useful information from it. The goal of the project is to use Erlang to parallelize and make the work simpler using distributed systems. To do so, we take an Association Rule Mining algorithm, Apriori, formulate its distributed version and implement them using erlang. We then check the code by running it on multiple machines setup with several nodes on each, in parallel using message passing from erlang. This algorithm is very easily coded in Erlang using very minimal number of lines of coding. Seeing this, we are encouraged to use Erlang to parallelize many other similar data mining algorithms. Introduction Data Mining and Association Rule Mining Data Mining is the computational process of discovering useful patterns in large data sets. These data patterns can be then transformed to some understandable form for analyzing or other further uses. Association Rule Mining is a type of Data Mining where specific relations between the items of large data sets are found and strong rules are formulated connecting some set of items to another. This method is widely used in super markets where they have large number of items and large data of the transactions made by many customers. The mining algorithms can be used in this case to find which the item sets are likely to be bought with some other sets of items. Erlang and Parallelization Data Mining algorithms are very important in the present world and are widely used. The main problem associated with it is the large data sets on which the algorithms are to be run. The solution to this is parallelization of the algorithms. This would increase the efficiency of the algorithms drastically in terms of time and memory. We can use a dynamically typed functional language, Erlang for parallelization of those algorithms. Erlang supports lightweight threads and message passing model of concurrency, which is what makes it suitable for the purpose.

Erlang Programming Language The efficiency in Erlang programming language is achieved by two methods. One is Concurrent Programming and the other is Distributed programming. Concurrent programming will make the algorithm time efficient and Distributed programming will make the algorithm utilize maximum of the resources in hand from all the connected systems. In Erlang, concurrency is achieved by Processes and Message Passing and distribution is achieved by creating Erlang Nodes. Erlang Processes Concurrency in erlang is done by message passing between processes for communication. Erlang processes are lightweight threads. There is inbuilt function spawn define in Erlang for creating process: spawn (Module_Name, Function_Name, Argument_List) We can create a process from the current node to another node by using one extra argument in spawn function: spawn (Node_Name, Module_Name, Function_Name, Argument_List) spawn returns the process identifier (pid) of the created process. Pids are used for communication with a process. Message Passing in Erlang Communication between processes is done by message passing in erlang. Receiver_Pid! Message! is the operator used for sending messages. Erlang uses asynchronous message passing mechanism, means sender will not wait for the acknowledgment from the receiver. Every process have a mailbox which they used for storing received messages. They used pattern matching for selecting messages from the mailbox. receive end pattern1 -> task1;... patternn -> taskn We can also register a process by a name and used that name instead of pid for communication.

Erlang Node Distributed programming in erlang is done by using erlang nodes. Erlang processes run on these nodes. We can create nodes on same host or different host. The syntax for this is erl -name node_name@host_name -setcookie cookie_name host_name is the IP address of the machine or machine name. Cookies are used to handle the security issues. If any node want to connect to a particular node then it must use the same cookie. We can use simpler version of above syntax for creating nodes on same machine. The syntax is erl -sname node_name Security is not the problem here because nodes are created on the same machine. Association Rule Association rule mining is used to capture all possible rules that can explain the presence of some sets of data items in the presence of some other set of data items. Association Rules are of the form: {Ia1, Ia2,., Ian} {Ib1, Ib2,.., Ibm} This says that whenever the set of items on left hand side occur in a transaction, the second itemset will also occur. It is highly unlikely to find such association of two sets of data for every transaction. So we use thresholds to find such strong association rules which hold in at least the minimum threshold number of transactions. For this purpose we assign two terms, Support and Confidence. Both of these terms have a threshold, support threshold and confidence threshold. Support is defined as ratio of occurrence of a set of items to the total number of transactions. Confidence for a rule, A B, is defined as ratio of occurrence of B whenever A occurs to the total occurrence of A. Owing to these terms, Frequent itemset can be defined as an itemset whose support is more than or equal to the support threshold. Also, a strong association rule is said to be an association rule whose confidence is more than or equal to the confidence threshold. Finally, the mining of association rules is done in two steps: Finding Frequent Itemsets Generating Strong Association Rules

The step of finding frequent itemsets is heavy in terms of time consumption as it involves scanning of large data sets. Various algorithms are present to find frequent itemsets. One of the basic and efficient algorithm is Apriori Algorithm. This is the algorithm we are trying to implement concurrently in this project. Apriori Algorithm Apriori is an algorithm to find frequent itemsets from a given set of transactions present in large data sets. It approach is to find frequency of sets of items and then pruning the list by using the given threshold value. It finds the frequency of all 1- item sets and then prune it, then 2-item sets and then prune it and continue this till no further n-item sets can be formed or the size of item sets have reached the total number of distinct items in the data set. To form (k+1)-item set, it uses frequent k- item set. The algorithm uses the fact that all subsets of a frequent itemset are also frequent. Finally all the frequent n-itemsets are combined to form all the frequent itemsets which will be the output of the algorithm. Conventional Apriori Algorithm The Apriori Algorithm works is a Candidate-generation-and-test paradigm. Its working principle is: If an itemset is frequent, then all its subsets are also frequent. This can also be stated as: If an itemset is not frequent, all of its supersets are also not frequent. The Algorithm generates candidate itemsets in increasing order of length. It then prunes the candidate itemset using the given support threshold to for the frequent itemset. Then it uses all frequent itemsets of a particular length to form candidate itemsets having length one more. The algorithm terminates when there is no more candidate itemset or the length of candidate itemsets reach the total number of items in the dataset. Stepwise Algorithm Initialization o 1-Candidate <- generate 1-item candidate set by finding frequency of each item. o call Gen_freq_set( 1-Candidate) and store it as 1-Frequent Gen_freq_set( k-candidate) o If k-candidate is empty or k equals number of items in the data set -> terminate

o k-frequent <- prune k-candidate using support threshold o Store the k-frequent Set o call Gen_cand_set( k-frequent) Gen_cand_set( k-frequent) o Join two candidates whose k-1 items are common o (k+1)-candidate <- merge two k-frequent such that k-1 items in both are same o Create subsets of (k+1)-candidate of size k and check if they are frequent. If true then keep the (k+1)-itemset in the candidate set or else discard it o call Gen_freq_set( (k+1)-candidate) Combine all the Frequent sets and return it. Parallelized Apriori Algorithm To parallelize the above algorithm and running it on distributed systems, we have to use approaches which includes division of large data sets in smaller chunks and multithreaded memory-sharing parallel algorithm. The process of dividing the data set and combining it again is repeatedly done for finding each k-item frequent set. For this purpose, many nodes are created on different systems which would carry out the operations of generating and candidate itemsets for the part of data set given to each one of them. There is also a main node which is there to communicate with all its child nodes through message passing. The function of the main node is to divide the large data set into smaller chunks and send it to all the computing nodes. The main node also collects candidate itemsets found from all the computing nodes, combine them and perform pruning to form the frequent itemsets. It then again sends this frequent itemset to each of the computing nodes to form candidate itemsets of size one higher. The process terminates with the termination of the main node which terminates after outputting the final frequent itemsets.

Stepwise Algorithm Create Main node and N Computing nodes Task of Main Node o Generation of 1-item frequent set Divide the data set in N parts and send it to all computing nodes Wait for receiving 1-item Candidate sets as messages from each computing node Combine all the small candidate sets to form a new candidate set -> 1-Candidate Call Gen_freq_set( 1-Candidate) Send 1-Frequent set as message to each computing node for finding small 2-item candidate sets. o Gen_freq_set( k-candidate) If k-candidate is empty or k equals number of items in the data set -> terminate k-frequent <- prune k-candidate using support threshold Store and return the k-frequent set o Collect_cand_set( ) Wait for receiving k-item Candidate set as messages from each computing node Combine all the small candidate sets to form a new candidate set -> k-candidate Call Gen_freq_set( k-candidate) If Gen_freq_set terminates Combine all the Frequent itemsets and return it Send terminate as message to each computing node terminate Else send the k-frequent set as message to each computing node for finding small (k+1)-item candidate sets Call Collect_cand_set( )

Task of Each Computing Node o Generation of 1-item candidate set Wait for the small data set as message from the Main node Count the occurrence of each item in the data set and store it as 1-item candidate set Send the 1-item candidate set to the main node for combining. o Gen_cand_set( ) Wait for terminate as message from Main node Terminate this node Wait for the k-frequent set as message from the Main node Join two candidates whose k-1 items are common (k+1)-candidate <- merge two k-frequent such that k-1 items in both are same Create subsets of (k+1)-candidate of size k and check if they are frequent If true then keep the (k+1)-itemset in the candidate set or else discard it Send the (k+1)-candidate set as message to the Main node for combining. Call Gen_cand_set( ) We have implement the first part of the algorithm that is to generate 1-item Frequent set. With this, the main frame of the algorithm is coded. The second part will require making the already coded processes to run recursively.

Observation We created varying number of nodes on one system, ran a particular Data set (no. of transactions = 100,000) for each case and recorded the time taken in each case to find 1-item Frequent set. The readings are as shown below No. of Nodes Execution Time (ms) 1 1208851 5 179889 10 87782 20 38480 30 22950 40 16503 50 19124 60 29217 70 30060 The table shows that the execution time decreases with increase in the number of node. But for much higher number of nodes, the execution time starts increasing with increase in the number of nodes. This happens as we are creating all the nodes on the same system, so, the creation takes time. Also, more number of nodes means division of data in more parts and collection of data from more parts before proceeding. This also increases the overall execution time considerably. Conclusion Observing the data from the above table, we can say that distributed programming is a very powerful feature of Erlang which can be used to significantly improve the efficiency of some Data Mining algorithm which uses large data sets. Erlang is a very simple functional language with small number of concepts. Yet, it can be used to code very complex algorithm in very few number of lines compared to other programming language. Apart from data mining algorithms, it can also be used to program softwares which can use its concurrency feature to improve performance.

References: http://www.erlang.org http://web.cse.iitk.ac.in/users/cs685/lectures.html A parallel Association Rule Mining algorithm by Zhi-gang Wang and Chi-She Wang. In: Web information System and Mining Lecture Notes in Computer Science Volume 7529, 2012, pp 125-129 Concurrent Programming in Erlang by Joe Armstrong, Robert Virding, Claes Wikstrom and Mike Williams Data Mining and Analysis: Fundamental Concepts and Algorithms by Mohammed J. Zaki and Wagner Meira Jr.