Apriori-Map/Reduce Algorithm



Similar documents
Market Basket Analysis Algorithm with Map/Reduce of Cloud Computing

Market Basket Analysis Algorithm on Map/Reduce in AWS EC2

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15

Application Development. A Paradigm Shift

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 14

Big Data and Apache Hadoop s MapReduce

What We Can Do in the Cloud (2) -Tutorial for Cloud Computing Course- Mikael Fernandus Simalango WISE Research Lab Ajou University, South Korea

Hadoop. Sunday, November 25, 12

How To Scale Out Of A Nosql Database

MapReduce. Introduction and Hadoop Overview. 13 June Lab Course: Databases & Cloud Computing SS 2012

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop

Verification and Validation of MapReduce Program model for Parallel K-Means algorithm on Hadoop Cluster

Keywords: Big Data, Hadoop, cluster, heterogeneous, HDFS, MapReduce

Hadoop Ecosystem B Y R A H I M A.

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Lecture 10: HBase! Claudia Hauff (Web Information Systems)!

R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5

Big Data, Why All the Buzz? (Abridged) Anita Luthra, February 20, 2014

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY


Introduction To Hive

MapReduce with Apache Hadoop Analysing Big Data

WROX Certified Big Data Analyst Program by AnalytixLabs and Wiley

CSE-E5430 Scalable Cloud Computing Lecture 2

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee June 3 rd, 2008

Google Bing Daytona Microsoft Research

DATA MINING OVER LARGE DATASETS USING HADOOP IN CLOUD ENVIRONMENT

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

From Wikipedia, the free encyclopedia

Bringing Big Data to People

Big Data Analytics OverOnline Transactional Data Set

Hadoop implementation of MapReduce computational model. Ján Vaňo

A bit about Hadoop. Luca Pireddu. March 9, CRS4Distributed Computing Group. (CRS4) Luca Pireddu March 9, / 18

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Open source Google-style large scale data analysis with Hadoop

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Development of Real-time Big Data Analysis System and a Case Study on the Application of Information in a Medical Institution

16.1 MAPREDUCE. For personal use only, not for distribution. 333

ITG Software Engineering

Comparison of Different Implementation of Inverted Indexes in Hadoop

Cleveland State University

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

Application and practice of parallel cloud computing in ISP. Guangzhou Institute of China Telecom Zhilan Huang

Big Data Explained. An introduction to Big Data Science.

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

Apache Hadoop FileSystem and its Usage in Facebook

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Data Mining in the Swamp

Big Data Drupal. Commercial Open Source Big Data Tool Chain

Hadoop and Map-Reduce. Swati Gore

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

BITKOM& NIK - Big Data Wo liegen die Chancen für den Mittelstand?

Dominik Wagenknecht Accenture

You should have a working knowledge of the Microsoft Windows platform. A basic knowledge of programming is helpful but not required.

Speak<geek> Tech Brief. RichRelevance Distributed Computing: creating a scalable, reliable infrastructure

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

BIG DATA, MAPREDUCE & HADOOP

BIG DATA TRENDS AND TECHNOLOGIES

Federated Cloud-based Big Data Platform in Telecommunications

How To Handle Big Data With A Data Scientist

Problem Solving Hands-on Labware for Teaching Big Data Cybersecurity Analysis

Hadoop IST 734 SS CHUNG

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Big Data Too Big To Ignore

A Brief Outline on Bigdata Hadoop

Cloud Application Development (SE808, School of Software, Sun Yat-Sen University) Yabo (Arber) Xu

BIG DATA CHALLENGES AND PERSPECTIVES

The little elephant driving Big Data

Peers Techno log ies Pv t. L td. HADOOP

Accelerating and Simplifying Apache

Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park

Hadoop Introduction coreservlets.com and Dima May coreservlets.com and Dima May

International Journal of Innovative Research in Information Security (IJIRIS) ISSN: (O) Volume 1 Issue 3 (September 2014)

Big Data? Definition # 1: Big Data Definition Forrester Research

Fast Data in the Era of Big Data: Twitter s Real-

Distributed Computing and Hadoop in Statistics

BBM467 Data Intensive ApplicaAons

Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software?

NoSQL and Hadoop Technologies On Oracle Cloud

MapReduce and Hadoop Distributed File System

Analysing Large Web Log Files in a Hadoop Distributed Cluster Environment

Cleveland State University

Distributed Apriori in Hadoop MapReduce Framework

VOL. 5, NO. 2, August 2015 ISSN ARPN Journal of Systems and Software AJSS Journal. All rights reserved

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

MapReduce for Data Warehouses

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

BIG DATA TECHNOLOGY. Hadoop Ecosystem

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

Hadoop: Distributed Data Processing. Amr Awadallah Founder/CTO, Cloudera, Inc. ACM Data Mining SIG Thursday, January 25 th, 2010

Constructing a Data Lake: Hadoop and Oracle Database United!

Testing 3Vs (Volume, Variety and Velocity) of Big Data

Distributed Computing and Big Data: Hadoop and MapReduce

Transcription:

Apriori-Map/Reduce Algorithm Jongwook Woo Computer Information Systems Department California State University Los Angeles, CA Abstract Map/Reduce algorithm has received highlights as cloud computing services with Hadoop frameworks were provided. Thus, there have been many approaches to convert many sequential algorithms to the corresponding Map/Reduce algorithms. The paper presents Map/Reduce algorithm of the legacy Apriori algorithm that has been popular to collect the item sets frequently occurred in order to compose Association Rule in Data Mining. Theoretically, it shows that the proposed algorithm provides high performance computing deping on the number of Map and Reduce nodes. Keywords: Map/Reduce, apriori algorithm, Data Mining, Association Rule, Hadoop, Cloud Computing 1 Introduction People started looking at and implementing Map/Reduce algorithm for most of applications, especially for computing Big Data that are greater than peta-bytes as cloud computing services are provided, for example, by Amazon AWS. Big Data has been generated in the areas of business application such as smart phone and social networking applications. Especially these days, the better computing power is more necessary in the area of Data Mining, which analyzes tera- or peta-bytes of data. Thus, the paper presents Apriori-Map/Reduce Algorithm that implements and executes Apriori algorithm on Map/Reduce framework. In this paper, section 2 is related work. Section 3 describes the legacy apriori algorithm. Section 4 introduces Map/Reduce and Hadoop and presents the proposed Apriori-Map/Reduce algorithm. Section 5 is conclusion. 2 Related Work Association Rule or Affinity Analysis is the fundamental data mining analysis to find the co-occurrence relationships like purchase behavior of customers. The analysis is legacy in sequential computation so that many data mining books never resist illustrating it. Aster Data has SQL MapReduce framework as a product [9]. Aster provides npath SQL to process big data stored in the DB. Market Basket Analysis is executed on the framework but it is based on its SQL API with MapReduce Database. Woo et al [11-13] presents Market Basket Analysis algorithms with Map/Reduce, which proposes the algorithm with (key, value) pair and execute the code on Map/Reduce platform. However, it does not use the apriori property but instead adopts joining function to produce paired items, which possibly computes unnecessary data. 3 Apriori Algorithm Apriori algorithm shown in Figure 3.1 has been used to generate the frequent item sets in the amount of data transactions in order to produce an association rule. Map transaction t in data source to all Map nodes; //(1) C 1 = {size 1 frequent items}; // (2) min_support = num / total items; for example: 33% L 1 = {size 1 frequent items min_support}; for (k = 1; L k!= ; k++) do begin // (3) sort to remove duplicated items C k+1 = L k join_sort L k ; for each transaction t in data source with C k+1 do // (4) increment the count of all candidates in C k+1 that are contained in t // (5) find L k+1 with C k+1 and min_support L k+1 = {size k+1 frequent items min_support}; return k L k ; Figure 3.1. Apriori Algorithm The association rule has been used efficiently to manage stock items and products etc analyzing the customer s behavior. It is based on Apriority Property where all subsets of a frequent item set must also be frequent. For example, minimum support is 2 and there are size 2 item sets generated: <[coffee, cracker], 3> and <[coke,

cracker], 1> as <[item pairs], frequency>. And, when there are size 3 item sets produced as <[coffee, cracker, milk]> and <[coke, cracker, milk]>, as Apriority Property, <[coke, cracker, milk]> is eliminated before counting the frequencies of the item sets in the transaction data, which reduce unnecessary computing time. The time complexity of the algorithm is O(k x (k 2 + t x n)) when k: size of frequent items, t: transaction data, n: number of item elements in each transaction t. It is simplified to O(k 3 + k x t x n) and then O(k x t x n) where t >> k, n>>k. 4 Apriori-Map/Reduce Algorithm 4.1 Map/Reduce in Hadoop Map/Reduce is an algorithm used in Artificial Intelligence as functional programming. It has been received the highlight since re-introduced by Google to solve the problems to analyze Big Data, defined as more than petabytes of data in distributed computing environment. It is composed of two functions to specify, Map and Reduce. They are both defined to process data structured in (key, value) pairs. Inspired by Google's MapReduce and GFS (Google File Systems) [1], Map/Reduce platform is implemented as Apache Hadoop project that develops open-source software for reliable, scalable, and distributed computing. Hadoop can compose hundreds of nodes that process and compute Big Data. Hadoop has been used by a global community of contributors such as Yahoo, Facebook, Cloudera, and Twitters etc. Hadoop provides many subprojects including Hadoop Common, HDFS, MapReduce, Avro, Chukwa, HBase, Hive, Mahout, Pig, and ZooKeeper etc [2]. Map 1() Map 2() Map m() (key 1, value 11) (key 2, value 12) (key n, value 1n) Reduce 1 () Input Data (key 1, value 21) (key 2, value 22) (key n, value 2n) Data Aggregation/Combine (key 1, <value 11, value 21,, value m1>) (key 1, value m1) (key 2, value m2) (key n, value mn) (key 2, <value 12, value 22,, value m2>) (key 1, final value 1) (key 2, final value 2) (key n, <value 1n, value 2n,, value mn>) Reduce 2() Reduce l() Figure 4.1. Map/Reduce Flows (key n, final value n) The map and reduce functions run on distributed nodes in parallel. Each map and reduce operation can be processed indepently on each node and all the operations can be performed in parallel. Map/Reduce can handle Big Data sets as data are distributed on HDFS (Hadoop Distributed File Systems) and operations move close to data for better performance [5]. Hadoop is restricted or partial parallel programming platform because it needs to collect data of (key, value) pairs as input and parallely computes and generates the list of (key, value) as output. In map function, the master node divides the input into smaller sub-problems, and distributes those to worker nodes. Map transaction t in data source to all Map nodes; //(1) In each Map node m C m1 = {size 1 frequent items at the node m}; // (2) In Reduce, compute C 1 and L 1 with all C m1 ; C 1 = {size 1 frequent items}; // (3) min_support = num / total items; for example: 33% L 1 = {size 1 frequent items min_support}; for (k = 1; L k!= ; k++) do begin // (4) In each Map node m // L mk : L k mapped to each node m; // sort to remove duplicated items C m(k+1) = L k join_sort L mk ; // (5) In Reduce, use Apriori Property compute C k+1 with all sorted C m(k+1) ; if (k>=3) prune(c k+1 ); for each transaction t in data source with C k+1 do // (6) In each Map node m increment the count of all candidates in L m(k+1) that are contained in t // (7) In Reduce, find L k+1 with L m(k+1) and // min_support L k+1 = {size k+1 frequent items min_support}; return k L k ; Figure 4.2. Apriori-Map/Reduce Algorithm Figure 4.1 illustrates Map/Reduce control flow where, as m is map node id, each valuemn is simply 1 and gets accumulated for the occurrence of items together, which is clearly illustrated in Market Basket Analysis Algorithm of Woo et al [11-13]. Map function takes inputs (k 1, v 1 ) and generates <k 2, v 2 > where < > represents list or set. Combiner function that resides on map node takes inputs (k 2, <v 2 >) and generates <k 2, v 2 >. Reduce function takes inputs (k 2, <v 2 >) and generates <k 3, v 3 >.

Map/Reduce approach, especially for big data set, gives us the opportunity to develop new systems and evolve IT in parallel computing environment. The approach started a few years ago but many IT companies in the world already have adapted to Map/Reduce approach. 4.2 Apriori-Map/Reduce Algorithm Figure 4.2 is the proposed Apriori-Map/Reduce Algorithm that runs on parallel Map/Reduce framework such as Apache Hadoop. prune(c k+1 ) function is to remove the nonfrequent item set C k+1 by eliminating non-frequent item sets C k as non-frequent item sets cannot be a subset of frequent item sets. The algorithm starts with (1) that calculates frequent item set for each map node as the time complexity O(t/m x n) when t: number of transactions, n: number of items in the transactions, m: num of map nodes. Then, (2) are to collect the frequent item set and (3) is to remove items that does not meet the minimum support in reduce nodes as O(t/r x n) when r: number of reduce nodes. The time complexity of (1-3) can be simplified initially to O((t/m + t/r) x n) and then O(t x n / x) when m = r = p. (4) is to calculate frequent item set with an additional item by joining, sorting, and eliminating the duplicated items in each map node where join_sort is O(k x k / m) when k: size of frequent items and prune is O((k-1) x k / r) that is simplified to O(k 2 /r). Similarly, (5) is to collect the frequent item set at the reduce nodes. (6) is to count item frequencies that do not meet the minimum support at the map nodes as O(t/m x n). (7) is to remove items that does not meet the minimum support in reduce nodes as O(t/r x n) that is initially simplified to O((t/m + t/r) x n) and then O(t x n / p) when m = r = p. 4.2.1 Time Complexity The overall time complexity of Apriori-Map/Reduce Algorithm is calculated as O(k x (k2 + t x n)/p) where k: size of frequent items, t: number of transactions, n: number of items in the transactions, p: number of map and reduce nodes assuming the node sizes are the same. And, it becomes O((k3 + k x t x n)/p) and then O(k x t x n/p) where t >> k, n >> k. It theoretically shows that the time complexity is p times less than the sequential apriori algorithm. 4.2.2 Example of the Algorithm Figure 4.3 is the example transaction data at a store to explain how the proposed algorithm works. Transaction 1: cracker, beer Transaction 2: chicken, pizza, Transaction 3: coke, cracker, beer Transaction 4: coke, cracker Transaction 5: beer, chicken, coke Transaction 6: chicken, coke Suppose that minimum support is 2/6 that represents two items out of 6 transactions as 33%. (a) The first step Assuming there are three Map nodes, two transaction data are distributed to three map nodes, that is, map node 1 has transaction data 1 and 2. Node 2 has transaction data 3 and 4. Node 3 has transaction data 5 and 6. Therefore, we can generate size 1 frequent items C m1 with an item pair set <item, frequency> at the map node m. C 11 = {<cracker, 1>, <beer, 1>, <chicken, 1>, <pizza, 1>} C 21 = {<coke, 2>, <cracker, 2>, <beer, 1>} C 31 = {<beer, 1>, <chicken, 2>, <coke, 2>} From all C m1, reduce nodes collect and compute C 1 that is size 1 frequent item pairs and L 1 that is size 1 frequent item pairs that meet minimum support. Thus, [pizza, 1] of C 1 is eliminated in L 1. C 1 = {[cracker, 3], [beer, 3], [chicken, 3], [pizza, 1], [coke, 4]}; L 1 = {[cracker, 3], [beer, 3], [chicken, 3], [coke, 4]} (b) The second step In the loop, L 1 is mapped to each map node m for L m1. L 11 = {cracker, beer} L 21 = {chicken} L 31 = {coke} Then, size 2 frequent item pair sets can be generated by joining and sorting L 1 to each item sets of the map node m as C m2 = L 1 join_sort L m1 where the duplicated item sets are eliminated: C 12 = {<beer, cracker>, <chicken, cracker>, <beer, chicken>, <coke, cracker>, <beer, coke>} C 22 = {<chicken, cracker>, <beer, chicken>, <chicken, coke>} C 32 = {<coke, cracker>, <beer, coke>, <chicken, coke>} From all C m2, reduce nodes collect C 2 that is size 2 frequent item pairs. C 2 = {<beer, cracker>, <chicken, cracker >, <beer, chicken>, <coke, cracker>, <beer, coke>, <chicken, coke>}; C2 is mapped to each map node as C m2 as follows: C 12 = {<beer, cracker>, <chicken, cracker >} C 22 = {<beer, chicken>, <coke, cracker>} C 32 = {<beer, coke>, <chicken, coke>} Figure 4.3 Transaction data at a store

Now, we can generate size 2 frequent items with an item pair set [item, frequency] at the node m that contains all transaction data as presented in Figure 4.2. C 12 = {[<beer, cracker>, 2], [<chicken, cracker>, 0]} C 22 = {[<beer, chicken>, 1], [<coke, cracker >, 2]} C 32 = {[<beer, coke>, 2], [<chicken, coke>, 2]} From all C m2, the reduce nodes collect and compute C 2 that is size 2 frequent item pairs and L 2 that is size 2 frequent item pairs that meet minimum support. Thus, [<chicken, cracker >, 0] and [<beer, chicken >, 1] are eliminated from C 2 for L 2 : C 2 = {[<beer, cracker>, 2], [<chicken, cracker >, 0], [<beer, chicken >, 1], [<coke, cracker>, 2], [<beer, L 2 = {[<beer, cracker>, 2], [<coke, cracker>, 2], [<beer, (c) The third step In the loop, L 2 is mapped to each map node m. L 12 = {<beer, cracker>, <coke, cracker>} L 22 = {<beer, coke>} L 32 = {<chicken, coke>} Then, size 3 frequent item pair sets can be generated by joining and sorting L 2 to each item sets of the map node m as C m3 = L 2 join_sort L m2 where the duplicated item sets are eliminated: C 13 = {<beer, coke, cracker>, <beer, chicken, coke>, <chicken, coke, cracker>} C 23 = {<beer, coke, cracker>, <beer, chicken, coke>} C 33 = {<beer, chicken, coke>, <chicken, coke, cracker>} As the item size k is or greater than 3, prune(c mk ) deletes non frequent item sets that violates apriori property that all subsets of a frequent item set must also be frequent: C 13 = {<beer, coke, cracker>} C 23 = {<beer, coke, cracker>} C 33 = {} C 13, C 23, C 33 eliminate <beer, chicken, coke> and C 13, C 33 removes <chicken, coke, cracker> as <beer, chicken> and <chicken, cracker> respectively are not a member of L 2, that is, non-frequent items. From all C m3, reduce nodes collect C 3 that is size 3 frequent item pairs and L 3 that is size 3 frequent item pairs that meet minimum support: C 3 = {<beer, coke, cracker>} L 3 = {} Since L 3 does not have any element, the algorithm s. Therefore, we have frequent item sets L 1 and L 2 with size 1 and 2 respectively: L1 = {[cracker, 3], [beer, 3], [chicken, 3], [coke, 4]} L2 = {[<beer, cracker>, 2], [<coke, cracker>, 2], [<beer, The item sets L 1 and L 2 can be used to produce association rule of the transaction. 5 Conclusion The paper proposes Apriori-Map/Reduce Algorithm and illustrates its time complexity, which theoretically shows that the algorithm gains much higher performance than the sequential algorithm as the map and reduce nodes get added. The item sets produced by the algorithm can be adopted to compute and produce Association Rule for market analysis. The future work is to build the code following the algorithm on Hadoop frame and generate experimental data by executing the code with the sample transaction data, which practically proves that the proposed algorithm works. Besides, the algorithm should be exted to produce association rule. 6 Reference [1] MapReduce: Simplified Data Processing on Large Clusters", Jeffrey Dean and Sanjay Ghemawa, Google Labs, pp. 137 150, OSDI 2004 [2] Apache Hadoop Project, http://hadoop.apache.org/, [3] Building a business on an open source distributed computing, Bradford Stephens, Oreilly Open Source Convention (OSCON) 2009, July 20-24, 2009, San Jose, CA [4] MapReduce Debates and Schema-Free, Woohyun Kim, Coord, March 3 2010 [5] Data-Intensive Text Processing with MapReduce, Jimmy Lin and Chris Dyer, Tutorial at the 11th Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL HLT 2010), June 2010, Los Angeles, California [6] SQL MapReduce framework, Aster Data, http://www.asterdata.com/product/advanced-analytics.php [7] Apache HBase, http://hbase.apache.org/ [8] Data-Intensive Text Processing with MapReduce, Jimmy Lin and Chris Dyer, Morgan & Claypool Publishers, 2010.

[9] GNU Coord, http://www.coordguru.com/ [10] The Technical Demand of Cloud Computing, Jongwook Woo, Korean Technical Report of KISTI (Korea Institute of Science and Technical Information), Feb 2011 [11] Market Basket Analysis Example in Hadoop, http://dal-cloudcomputing.blogspot.com/2011/03/marketbasket-analysis-example-in.html, Jongwook Woo, March 2011 [12] Jongwook Woo, Siddharth Basopia, Yuhang Xu, Seon Ho Kim, Market Basket Analysis Algorithm with NoSQL DB HBase and Hadoop, The Third International Conference on Emerging Databases (EDB 2011), Korea, Aug. 25-27, 2011 [13] Jongwook Woo and Yuhang Xu, Market Basket Analysis Algorithm with Map/Reduce of Cloud Computing, The 2011 international Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA 2011), Las Vegas, July 18-21, 2011