Distributed Apriori in Hadoop MapReduce Framework



Similar documents
Improving Apriori Algorithm to get better performance with Cloud Computing

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

DEVELOPMENT OF HASH TABLE BASED WEB-READY DATA MINING ENGINE

Chapter 7. Using Hadoop Cluster and MapReduce

Performance Analysis of Apriori Algorithm with Different Data Structures on Hadoop Cluster

Mining Interesting Medical Knowledge from Big Data

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU

Optimization and analysis of large scale data sorting algorithm based on Hadoop

NoSQL and Hadoop Technologies On Oracle Cloud

Accelerating Hadoop MapReduce Using an In-Memory Data Grid

Hadoop Parallel Data Processing

Distributed Computing and Big Data: Hadoop and MapReduce

Project Report. 1. Application Scenario

Extreme Computing. Hadoop MapReduce in more detail.

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Developing a MapReduce Application

Association Rule Mining using Apriori Algorithm for Distributed System: a Survey

Big Data and Scripting map/reduce in Hadoop

A Performance Analysis of Distributed Indexing using Terrier

Market Basket Analysis Algorithm with Map/Reduce of Cloud Computing

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

A hybrid algorithm combining weighted and hasht apriori algorithms in Map Reduce model using Eucalyptus cloud platform

Study of Data Mining algorithm in cloud computing using MapReduce Framework

Fault-Tolerant Computer System Design ECE 695/CS 590. Putting it All Together

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Introduction to DISC and Hadoop

Peers Techno log ies Pv t. L td. HADOOP

CSCI6900 Assignment 2: Naïve Bayes on Hadoop

University of Maryland. Tuesday, February 2, 2010

Integrating Big Data into the Computing Curricula

Distributed Data Mining Algorithm Parallelization

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Classification On The Clouds Using MapReduce

Analysis of MapReduce Algorithms

Advances in Natural and Applied Sciences

Building A Smart Academic Advising System Using Association Rule Mining

MapReduce: Algorithm Design Patterns

Data Mining in the Swamp

Big Data Processing with Google s MapReduce. Alexandru Costan

Map Reduce & Hadoop Recommended Text:

Tackling Big Data with MATLAB Adam Filion Application Engineer MathWorks, Inc.

16.1 MAPREDUCE. For personal use only, not for distribution. 333

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

IMPLEMENTATION OF P-PIC ALGORITHM IN MAP REDUCE TO HANDLE BIG DATA

The Performance Characteristics of MapReduce Applications on Scalable Clusters

A Cost-Evaluation of MapReduce Applications in the Cloud

2015 The MathWorks, Inc. 1

CSE-E5430 Scalable Cloud Computing Lecture 2

Searching frequent itemsets by clustering data

The basic data mining algorithms introduced may be enhanced in a number of ways.

Apriori-Map/Reduce Algorithm

COURSE CONTENT Big Data and Hadoop Training

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Introduction to Hadoop

Energy Efficient MapReduce

International Journal of Innovative Research in Computer and Communication Engineering

Market Basket Analysis Algorithm on Map/Reduce in AWS EC2

Hadoop IST 734 SS CHUNG

HADOOP MOCK TEST HADOOP MOCK TEST II

BIG DATA - HADOOP PROFESSIONAL amron

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

t] open source Hadoop Beginner's Guide ij$ data avalanche Garry Turkington Learn how to crunch big data to extract meaning from

Image Search by MapReduce

The Hadoop Framework

Fast Data in the Era of Big Data: Twitter s Real-

Introduction to Spark

Pro Apache Hadoop. Second Edition. Sameer Wadkar. Madhu Siddalingaiah

Log Mining Based on Hadoop s Map and Reduce Technique

MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research

Apache Hadoop. Alexandru Costan

ITG Software Engineering

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop

HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM

Leveraging BlobSeer to boost up the deployment and execution of Hadoop applications in Nimbus cloud environments on Grid 5000

Similarity Search in a Very Large Scale Using Hadoop and HBase

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

An Approach to Implement Map Reduce with NoSQL Databases

Big Application Execution on Cloud using Hadoop Distributed File System


Task Scheduling in Hadoop

Sriram Krishnan, Ph.D.

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Distributed Framework for Data Mining As a Service on Private Cloud

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop

Developing MapReduce Programs

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Transcription:

Distributed Apriori in Hadoop MapReduce Framework By Shulei Zhao (sz2352) and Rongxin Du (rd2537) Individual Contribution: Shulei Zhao: Implements centralized Apriori algorithm and input preprocessing code in Java Rongxin Du: Configures distributed environment and implements distributed Apriori using Hadoop MapReduce APIs. Cooperative Contribution: Debugging and code cleaning 1. Introduction This final report will focus on how we transform the centralized Apriori [1] algorithm to a distributed structure that could be applied in MapReduce framework and produces the same correct results as the centralized one. However, the MapReduce distributed framework is essentially a shared-nothing distributed structure, which means that each worker (slave) can only communicate with the master node and we prohibit communication among the workers (salves). In our implementation, we use a different combination of loop and recursion algorithm to implement the apriori_gen() and subset() function in Apriori, which will be discussed later. Let s first briefly review these two important functions in the traditional Apriori. The apriori_gen() function takes the set of all (k-1) large itemset as arguments and returns a superset of the k large itemset as the candidate set. The outer for loop keep iterating k to further generate the candidate set for all levels of large itemset. Noticing that the (k-1) itemset argument is observed as global large, thus we can t make a meaningful distributed form of this function (for sure, you can replicate the (k-1) large itemsets among all the sites, but this is trivial and makes no sense). Suppose for the k-th iteration in the outer for loop, the candidate set for k large itemset is generated by apriori_gen(), C k = apriori-gen(l k -1). For this C k, we further search the whole transaction database (the inner for loop), using each line (a transaction of items) to refine (prune) C k to a truncated subset C t, which should also be the superset of L k. This process is possible to be distributed in MapReduce, as long as each site i could get access to the global candidate set Ck. For each worker site i, simply scanning local data partition Di and pruning C k to C k(i). For the master node, collect the C k(i) from workers to calculate C k = C k(i), supposing there are N partitions on N worker sites. The occurrences counting of itemsets in the refined candidate set C t could be done on-the-fly along with the condition filter that count > min_sup when the master node is doing the above reduce work. 1

2. Distributed Apriori in MapReduce The following diagram illustrates the components and data flow of our distributed Apriori in MapReduce framework. The green downward arrow represents the source data we are going to process from a transaction centralized database. All the input and output files are formatted in HDFS. We take 3 map-reduce steps to accomplish the Apriori. A MapReduce first maps the input dataset to N partitions (where N is equal to the number of slave machines or daemons) and meanwhile emits a key-value pair <item, 1> for each item in a line. Then the reduce phase would take the intermediate key-value pairs emitted in the map phase and send them to the master node to collect the count per item. Figure.1 Distributed Apriori in MapReduce The first job component in Figure 1 is to generate frequent 1 items in the form of <item, count> key-value pair (where the count indicates the number of occurrences), which are also added to the global HashMap item list. O 1 is the output file of the frequent items job. The second job component is to generate candidate sets. This candidate generate job also takes the source data file. In the map phase, it firstly prunes those items that occur less than the support threshold by looking up the global HashMap item list, and then recursively calls function CandidatesGenRecursion() to generate candidates. The reduce phase groups the same candidate itemset and collect the count as <candidate-itemset, count> key-value pairs, which are written as the output and also added to the global HashMap comb. The last job component called Association Rule takes the output of the candidate generate job as the input file and mining association rules on the candidate itemsets. Note that we prune the left hand of an association rule by looking up its count in the HashMap comb and 2

calculate the confidence: Association Rule {X XY}.conf = XY.count / X.count 3. New Way for Candidate Generation We design the following candidate generation algorithm which is a loop and recursion fashion. INPUT: a list of frequent 1 large item set OUTPUT: All possible candidate itemsets (superset of the large itemsets) In the i-th iteration of for-loop Generate different choices of item-pairs(2 items as a pair ) [i, i+1], [i, i+2], [i, i+3],, [i, INPUT.length] In the j-th level of recursion generate different sizes of candidate for a given choice [i, i+1] [i, i+1, i+2] [i, i+1, i+2, i+3], [i, i+2] [i, i+2, i+3] [i, i+2, i+3, i+4], [i, INPUT.length-1] [i, INPUT.length-1, INPUT.length] 4. Association Rule Mining We did an interesting trick here to mine the association rules. We reuse the loop + recursion idea, basically the candidate generation algorithm illustrated above, except that we substitute INPUT with the k-large itemset. Then the output will be possible left hand body of the association rule. For example, if the 3-large itemset is {1,2,3}, we will use the loop + recursion algorithm to generate {1,2}, {1,3}, {2,3}, {1,2,3} as the left-hand body, and the trivial one {1,2,3} {1,2,3} will be discarded. Accordingly, we might mine out the following association rules (we need further prune, prune techniques will be discussed later) {1,2} {1,2,3} ( equivalent to {1,2} {3} ) {1,3} {1,2,3} ( equivalent to {1,3} {2} ) {2,3} {1,2,3} ( equivalent to {2,3} {1} ) 5. Prune Techniques We are using the following 4 prune techniques to reduce the candidate sets size and also to ensure that the output of the 2nd job (candidate generate in Figure 1) is also the large itemsets. (1). In the candidate generate job map phase, we only take those items with count > support as the INPUT of the loop + recursion algorithm described in above Candidate Generation section; In addition, here, we have to keep a global HashMap item list to look up the item count. (2). In the candidate generate job reduce phase, we only output those candidate itemsets with count > support. These output candidate itemsets are exactly the large itemsets globally. 3

In addition, here, to keep track of the large itemsets, we have to keep a global HashMap comb to save these large itemsets for future lookup. (3). In the association rule job map phase, since we take the output of the previous job as the right-hand body, the right-hand body will always be large itemsets. For every generated left-hand body by using loop + recursion algorithm on the right-hand body, we have to check whether it is globally large. To achieve this, we simply look up the HashMap comb we saved in the previous job and retrieve the count. (4). Association rule prune on calculating the confidence, trivial. 6. Experiment and Results Our data comes from [2], which consists of 1000 lines of market-basket transactions. There are basically 4 modes in Hadoop MapReduce: (1). stand-alone mode (the same machine as the master and also the slave); (2). Pseudo-Distributed mode (where each Hadoop daemon runs in a separate Java process); (3). Cluster mode (need a cluster of real machines); (4). Using Amazon Elastic MapReduce; We test our code under mode 1 and mode 2, and it works giving the correct result. Here is the first 3 records from the result when support = 10%, confidence = 80%, for example the first association rule can be read as customers who buy artichok and ham also tend to buy heineken, the left-hand occurs 115 times and the right-hand occurs 111 times, the confidence is 96.52% {artichok ham} => [artichok, heineken, ham] {X}:115 => {X,Y}:111 Confidence(percentage)= 96.521736 {artichok ham} => [avocado, artichok, heineken, ham] {X}:115 => {X,Y}:111 Confidence(percentage)= 96.521736 {avocado artichok ham} => [avocado, artichok, heineken, ham] {X}:111 => {X,Y}:111 Confidence(percentage)= 100.0 Another interesting issue is that our distributed Apriori algorithm is sensitive to the order of items in a single transaction and also the duplicates. For example, if transaction is like apple, orange, apple, in our algorithm {apple, orange} and {orange, apple} will be treated as different candidate itemset, and also {apple, apple} could also be a candidate, which makes no sense. However, this problem is very common, think about when you check out at the cashier in a supermarket, of course you might buy 2 apples together. In order to solve this problem, we write a sorting program to sort the input file to ensure a transaction (a line in CSV) is arranged in increasing numeric or alphabetic order and eliminate all the duplicates. 7. Complexity Analysis 4

Apriori algorithm faces the following difficulties [3]: It spends so much time dealing with particularly large number of candidate sets since candidate (k + 1) itemsets are constructed through the self-join of frequent k-itemsets. Its growth is exponential. Here we didn t use the self-join, instead, we use a recursion enclosed by the for-loop. The time complexity is about the same, O ((INPUT. Length) 2 ), however, the subtle difference is that we choose the space as the tradeoff of time. As we know, recursive function calls occupies lots of run-time function frames in the stack. Suppose that we map into N partitions, the memory overhead compared to a single partition would be 1/N. To conclude we achieve the same goal of generating candidates without using the self-join operation. 8. Problems and Future Work We got stuck in running in Amazon Elastic MapReduce, the problem is that the global variables are not visible on the cloud to other virtual machines, and it seems that Hadoop will not distribute these global variables to be shared by every partition due to its nature shared-nothing architecture. We try many methods to get it work, including writing the global variable as files to Amazon S3 (Simple Storage Service) file system by giving the path like s3n://folder/file. However even though we can write files on the cloud, the files seem not visible to Hadoop HDFS, but we don t know how to reformat files on cloud to the HDFS. Since testing MapReduce code in stand-alone and Pseudo-Distributed mode is trivial, and it will be even slower than running a centralized algorithm, because of the partitioning and collecting overheads. We are really eager to figure out how to solve the above problem and run our code on the cloud to process huge data like 100M or more, and also testing the speed-up and scalability of our distributed Apriori by opening up multiple slave machines in EMR [5]. Maybe we can also try to use the DistributedCache [6] to cache files and pass to a mapper by overwrite the setup() function. References [1] R.Agrawal and R.Srikant, Fast Algorithms for Mining Association Rules. Proceedings of the 20th VLDB conference Santiago, Chile, 1994 [2] AssociationsSP.txt, High-Performance Information Computing Center, Computer Information Systems Department of California State University, Los Angeles, April 2012 [3] X. Y. Yang, Z. Liu and Y. Fu, MapReduce as a Programming Model for Association Rules Algorithm on Hadoop, Information Sciences and Interaction Sciences (ICIS), 2010 3rd International Conference on, 99-102, 23-25 June 2010 [4] http://hadoop.apache.org/common/docs/current/mapred_tutorial.html [5] http://aws.amazon.com/elasticmapreduce/ [6]http://archive.cloudera.com/cdh/3/hadoop/api/org/apache/hadoop/filecache/DistributedCac he.html [7] http://cscarioni.blogspot.com/2010/11/hadoop-basics.html 5