CLUSTBIGFIM-FREQUENT ITEMSET MINING OF BIG DATA USING PRE-PROCESSING BASED ON MAPREDUCE FRAMEWORK

Transcription

1 CLUSTBIGFIM-FREQUENT ITEMSET MINING OF BIG DATA USING PRE-PROCESSING BASED ON MAPREDUCE FRAMEWORK Sheela Gole 1 and Bharat Tidke 2 1 Department of Computer Engineering, Flora Intitute of Technology, Pune, India ABSTRACT Now a day enormou amount of data i getting explored through Internet of Thing (IoT) a technologie are advancing and people ue thee technologie in day to day activitie, thi data i termed a Big Data having it characteritic and challenge. Frequent Itemet Mining algorithm are aimed to dicloe frequent itemet from tranactional databae but a the dataet ize increae, it cannot be handled by traditional frequent itemet mining. MapReduce programming model olve the problem of large dataet but it ha large communication cot which reduce execution efficiency. Thi propoed new pre-proceed k-mean technique applied on BigFIM algorithm. ClutBigFIM ue hybrid approach, clutering uing k- mean algorithm to generate Cluter from huge dataet and Apriori and Eclat to mine frequent itemet from generated cluter uing MapReduce programming model. Reult hown that execution efficiency of ClutBigFIM algorithm i increaed by applying k-mean clutering algorithm before BigFIM algorithm a one of the pre-proceing technique. KEYWORDS Aociation Rule Mining, Big Data, Clutering, Frequent Itemet Mining, MapReduce. 1. INTRODUCTION Data mining and KDD (Knowledge Dicovery in Databae) are eential technique to dicover hidden information from large dataet with variou characteritic. Now a day Big Data ha bloom in variou area uch a ocial networking, retail, web blog, forum, online group [1]. Frequent Itemet Mining i one of the important technique of ARM. Goal of FIM technique i to reveal frequent itemet from tranactional databae. Agrawal et al. [2] put forward Apriori algorithm which generate frequent itemet having frequency greater than minimum upport given. It i not efficient on ingle computer when dataet ize increae. Enormou amount of work ha been put forward to uncover frequent item. There exit variou parallel and ditributed algorithm which work on large dataet but having memory and I/O cot limitation and cannot handle Big Data [3] [4]. MapReduce developed by Google [5] along with hadoop ditributed file ytem i exploited to find out frequent itemet from Big Data on large cluter. MapReduce ue parallel computing approach and HDFS i fault tolerant ytem. MapReduce ha Map and Reduce function; data flow in MapReduce i hown in below figure. DOI: /ijfct

2 Figure 1. Map-Reduce Data flow. In thi paper, baed on BigFIM algorithm, a new algorithm optimizing the peed of BigFIM algorithm i propoed. Firtly uing parallel K-Mean clutering cluter are generated from Big Dataet. Then cluter are mined uing ClutBigFIM algorithm, effectively increaing the execution efficiency. Thi paper i organized a follow ection 2 give overview of related work done on frequent itemet mining. Section 3 give overview of background theory for ClutBigFIM. Section 4 explain peudo code of ClutBigFIM. The experimental reult with comparative analyi are given in ection 5. Section 6 conclude the paper. 2. RELATED WORK Variou equential and parallel frequent itemet parallel algorithm are available [5] [6] [7] [8] [9] [10]. But there i need of FIM algorithm which can handle Big Data. Thi ection give an inight into frequent itemet mining which exploit MapReduce framework. The exiting algorithm have challenge while dealing with Big Data. Parallel implementation of traditional Apriori algorithm baed on MapReduce framework i put forward by Lin et al. [11] and Li et al. [12] alo propoed parallel implementation of Apriori algorithm. Hammoud [13] ha put forward MRApriori algorithm which i baed on MapReduce programming model and claic Apriori algorithm. It doe not require repetitive can of databae which ue iterative horizontal and vertical witching. Parallel implementation of FP-Growth algorithm ha been put forward in [14]. Liu et al. [15] ha been put forward IOMRA algorithm which i a modified FAMR algorithm optimize execution efficiency by pre-proceing uing Apriori TID which remove all low frequency 1-item itemet from given databae. Then poible longet candidate itemet ize i determined uing length of each tranaction and minimum upport. 80

3 Moen et al. [16] ha been put forward two algorithm uch a DitEclat and BigFIM, DitEclat i ditributed verion of Eclat algorithm which mine prefix tree and extract frequent itemet fater but not calable enough. BigFIM applie Apriori algorithm before DitEclat to handle frequent itemet till ize k and next k+1 item are extracted uing Eclat algorithm but BigFIM algorithm ha limitation on peed. Both algorithm are baed on MapReduce framework. Currently Moen alo propoed implementation of DitEclat and BigFIM algorithm uing Mahout. Approximate frequent itemet are mined uing PARMA algorithm which ha been put forward by Riondato et al. [17]. K-mean clutering algorithm i ued for finding cluter which i called a ample lit. Frequent item et are extracted very fat, reducing execution time. Malek and Kadima [18] ha been put forward parallel k-mean clutering which ue MapReduce programming model for generating cluter parallel by increaing performance of traditional K- Mean algorithm. It ha Map, Combine and Reduce function which ue (key, value) pair. Ditance between ample point and random centre are calculated for all point uing map function. Intermediate output value from map function are combined uing combiner function. All ample are aigned to cloet cluter uing reduce function. 3. BACKGROUND 3.1. Problem Statement Let I be a et of item, I = {i 1,i 2,i 3,,i n }, X i a et of item, X = {i 1,i 2,i 3,,i k } I called k - itemet. A tranaction T = {t 1, t 2, t 3,,t m }, denoted a T = (tid, I) where tid i tranaction ID. T D, where D i a tranactional databae. The cover of itemet X in D i the et of tranaction ID containing item from X. Cover(X, D) = {tid (tid, I) D, X I} The upport of an itemet X in D i count of tranaction containing item from X. Support (X, D) = Cover(X, D) An itemet i called frequent when it abolute minimum upport threhold σ ab, with 0 σ ab D. Partitioning of tranaction into et of group i called clutering. Let be the number of cluter then {C 1, C2, C3 C} i a et of cluter from {t 1, t 2, t 3,,t m }, where m i number of tranaction. Each tranaction i aigned to only one cluter i.e. C p φ C p C q for 1 p, q, C p i called a cluter. Let µ z be the mean of cluter C z, quared error between mean of cluter and tranaction in cluter i given a below, J (C ) = ti C t i µ k-mean i ued for minimizing um of quared error over all S cluter and i given by, S J (C ) = = 1 ti C 2 2 t i µ k-mean algorithm tart with one cluter and aign each tranaction to cluter with minimum quared error. 81

4 3.2. Apriori Algorithm Apriori i the firt frequent itemet mining algorithm which ha been put forward by Agarwal et al. [19]. Tranactional databae ha tranaction identifier and et of item preenting tranaction. Apriori algorithm can the horizontal databae and find frequent item of ize 1-item uing minimum upport condition. From thee frequent item dicovered in iteration 1 candidate itemet are formed and frequent itemet of ize two are extracted uing minimum upport condition. Thi proce i repeated till either lit of candidate itemet or frequent itemet i empty. It require repetitive can of databae. Monotonicity property i ued for removing frequent item Eclat Algorithm Eclat algorithm i propoed by Zaki et al. [20] which work on vertical databae. TID lit of each item i calculated and interection of TID lit of item i ued for extracting frequent itemet of ize k+1. No need of iterative can of databae but expenive to manipulate large TID lit k-mean Algorithm The k-mean algorithm [21] i well known technique of clutering which take number of cluter a input, random point are choen a centre of gravity and ditance meaure to calculate ditance of each point from centre of gravity. Each point i aigned to only one cluter baed on high intra-cluter imilarity and low inter-cluter imilarity. 4. CLUSTBIGFIM ALGORITHM Thi ection give high level architecture of ClutBigFIM algorithm and peudo code of phae ued in ClutBigFIM algorithm High Level Architecture Figure 2. High Level Architecture of ClutBigFIM Algorithm Clutering i applied on large dataet a one of the pre-proceing technique and then frequent itemet are mined from clutered data uing frequent itemet mining algorithm, Apriori and Eclat. 82

5 4.2. ClutBigFIM on MapReduce ClutBigFIM algorithm ha below phae, a. Find Cluter b. Finding k-fi c. Generate ingle global TID lit d. Mining of ubtree Find Cluter K-mean clutering algorithm i ued for finding cluter from given large dataet. Cluter of tranaction are formed baed on below formula which calculate minimum quared error, J (C ) = ti C t i µ and aign each tranaction to the cluter. Input to thi phae i tranaction dataet and number of cluter, cluter of tranaction are generated like C={t 1,t 10,...t }. 2 Input : Cluter Size and Dataet Output : Cluter with ize z Step : 1. Find ditance between centre and tranaction id in map phae. 2. Ue combiner function to combine reult of above tep. 3. Compute MSE uing below formula and aign all point to cluter in reduce phae, J (C ) = S J (C ) = = 1 ti C t i µ ti C 2 2 t i µ 4. Repeat tep 1-3 by changing Centre and top when convergence criteria i reached Finding k-fi Tranaction ID lit for large dataet cannot be handled by Eclat algorithm, So frequent itemet of ize k are mined from generated cluter in above phae uing Apriori algorithm baed on minimum upport condition which handle problem of large dataet. Prefix tree i generated uing frequent itemet. 83

6 Input : Cluter Size, Minimum threhold σ, prefix length(l) Output : Prefixe with length l and k-fi Step : 5. Find upport of all item in a cluter uing Apriori algorithm. 6. Apply Support (x i )> σ and calculate FI uing monotonic property. 7. Repeat tep 5-6 till calculating all k-fi uing mapper and reducer. 8. Repeat tep 5-7 for cluter (1 To S) and find final k-fi. 9. Keep created prefixe in lexicographic order uing lexicographic prefix tree Generate ingle global TID lit Eclat algorithm ue vertical databae, item and lit of tranaction where item i preent. The global TID lit i generated by combining local TID lit uing mapper and reducer. Generated TID lit i ued in next phae. Input : Prefix Tree, Min Supportσ Output : Single TID lit of all item Step : 10. Calculate TID lit uing prefix tree in map phae 11. Create ingle TID lit from TID lit generated in above tep. Perform pruning with upport( i a ) upport( i b ) a < b 12. Generate prefix group, P k = (P k 1, P k 2,, P k n ) Mining of Subtree Next (k+1) FI are mined uing Eclat algorithm. Prefix tree generated in phae2 i mined independently by mapper and frequent itemet are generated. Input : Prefix tree, Minimum upportσ Output : k-fi Step : 13. Apply Eclat algorithm and find FI till ize k. 14. Repeat tep 13 for each Subtree in map phae. 15. Find all frequent item of ize k and tore them in compreed trie format. 84

7 5. EXPERIMENTS Thi ection give overview of dataet ued and experimental reult with comparative analyi. For experiment 2 machine are going to be ued. Each machine contain Intel Core i5-3230m [email protected] proceing unit and 6.00GB RAM with Ubuntu and Hadoop Currently algorithm run on ingle peudo ditributed hadoop cluter. Dataet ued from tandard UCI repoitory and FIMI repoitory in order to compare reult with exiting ytem uch a DitEclat and BigFIM Dataet Information Experiment are performed on below dataet, Muhroom Provided by FIMI repoitory [22] ha 119 item and 8,124 tranaction. T10I4D100K- Provided by UCI repoitory [23] ha 870 item and 100,000 tranaction. Retail - Provided by UCI repoitory [23]. Pumb - Provided by FIMI repoitory [22] ha 49,046 tranaction Reult Analyi Experiment are performed on T10I4D100K, Retail, Muhroom and Pumb dataet and execution time required for generating k-fi i compared baed on number of mapper and Minimum Support. Reult hown that Dit-Eclat i fater than BigFIM and ClutBigFIM algorithm on T10I4D100K but Dit-Eclat algorithm i not working on large dataet uch a Pumb. Dit-Eclat i not calable enough and face memory problem a the dataet ize increae. Experiment performed on T10I4D100K dataet in order to compare execution time with different Minimum Support and number of mapper on Dit-Eclat, BigFIM and ClutBigFIM. Table 1. how Execution Time (Sec) for T10I4D100K dataet with different value of Minimum Support and 6 number of mapper. Figure 3. how timing comparion for variou method on T10I4D100K dataet which how that Dit-Eclat ha fater performance over BigFIM and ClutBigFIM algorithm. Execution time decreae a Minimum Support value increae which how effect of Minimum Support on execution time. Table 2. how Execution Time (Sec) for T10I4D100K dataet with different value of Number of mapper and Minimum Support 100. Figure 4. how timing comparion for variou method on T10I4D100K dataet which how that Dit-Eclat ha fater performance over BigFIM and ClutBigFIM algorithm. Execution time increae a number of mapper increae a communication cot between mapper and reducer increae. Table 1. Execution Time (Sec) for T10I4D100K with different Support. Dataet T10I4D100K Algorithm Min. Support Dit-Eclat BigFIM ClutBigFIM No. of Mapper

8 Table 2. Execution Time (Sec) for T10I4D100K with different No. of Mapper Dataet T10I4D100K Algorithm Number of Mapper Dit-Eclat BigFIM ClutBigFIM Minimum Support Figure 3. Timing comparion for variou method and Minimum Support on T10I4D100K Figure 4. Timing comparion for different method and No. of Mapper on T10I4D100K 86

9 Reult have been hown that ClutBigFIM algorithm work on Big Data. Experiment are performed on Pumb dataet. Dit-Eclat algorithm faced memory problem with Pumb dataet. Reult of ClutBigFIM are compared with BigFIM algorithm which i calable. Table 3. and Table 4. how execution time taken for BigFIM and ClutBigFIM algorithm on Pumb dataet with variable Minimum Support and No. of Mapper. Number of mapper i 20 and Minimum Support i for the experiment. Figure 3. And Figure 5 and Figure 6. how that ClutBigFIM algorithm ha better performance over BigFIM algorithm due to preproceing. Table 3. Execution Time (Sec) for Pumb with different Support. Dataet Pumb Algorithm Min. Support BigFIM ClutBigFIM No. of Mapper - 20 Table 4. Execution Time (Sec) for Pumb with different No. of Mapper Dataet Pumb Algorithm Number of Mapper BigFIM ClutBigFIM Minimum Support Figure 5. Timing comparion for different method and Minimum Support on Pumb 87

10 Figure 6. Timing comparion for different method and No. of Mapper on Pumb 6. CONCLUSIONS In thi paper we implemented FIM algorithm baed on MapReduce programming model. K- mean clutering algorithm focue on pre-proceing, frequent itemet of ize k are mined uing Apriori algorithm and dicovered frequent itemet are mined uing Eclat algorithm. ClutBigFIM work on large dataet with increaed execution efficiency uing pre-proceing. Experiment are done on tranactional dataet, reult hown that ClutBigFIM work on Big Data very efficiently and with higher peed. We are planning to run ClutBigFIM algorithm on different dataet for further comparative analyi. REFERENCES [1] Uama Fayyad, Gregory Piatetky-Shapiro, and Padhraic Smyth The KDD proce for extracting ueful knowledge from volume of data. Commun. ACM 39, 11 (November 1996), DOI= / [2] Rakeh Agrawal, Tomaz Imielińki, and Arun Swami Mining aociation rule between et of item in large databae. SIGMOD Rec. 22, 2 (June 1993), DOI= / [3] M. Zaki, S. Parthaarathy, M. Ogihara, and W. Li. Parallel algorithm for dicovery of aociation rule. Data Min. and Knowl. Dic., page , [4] G. A. Andrew. Foundation of Multithreaded, Parallel, and Ditributed Programming. Addion- Weley, [5] J. Li, Y. Liu, W. k. Liao, and A. Choudhary. Parallel data mining algorithm for aociation rule and clutering. In Intl. Conf. on Management of Data, [6] E. Ozkural, B. Ucar, and C. Aykanat. Parallel frequent item et mining with elective item replication. IEEE Tran. Parallel Ditrib. Syt., page , [7] M. J. Zaki. Parallel and ditributed aociation mining: A urvey. IEEE Concurrency, page 14 25, [8] L. Zeng, L. Li, L. Duan, K. Lu, Z. Shi, M. Wang, W. Wu, and P. Luo. Ditributed data mining: a urvey. Information Technology and Management, page , [9] J. Han, J. Pei, and Y. Yin. Mining frequent pattern without candidate generation. SIGMOD Rec., page 1 12,

11 [10] L. Liu, E. Li, Y. Zhang, and Z. Tang. Optimization of frequent itemet mining on multiple-core proceor. In Proceeding of the 33rd international conference on Very large data bae, VLDB 07, page VLDB Endowment, [11] M.-Y. Lin, P.-Y. Lee and S.C. Hueh. Apriori-baed frequent itemet mining algorithm on MapReduce. In Proc. ICUIMC, page ACM, [12] N. Li, L. Zeng, Q. He, and Z. Shi. Parallel implementation of Apriori algorithm baed on MapReduce. In Proc. SNPD, page , [13] S. Hammoud. MapReduce Network Enabled Algorithm for Claification Baed on Aociation Rule. Thei, [14] L. Zhou, Z. Zhong, J. Chang, J. Li, J. Huang, and S. Feng. Balanced parallel FP-Growth with MapReduce. In Proc. YC-ICT, page , [15] Sheng-Hui Liu; Shi-Jia Liu; Shi-Xuan Chen; Kun-Ming Yu, "IOMRA - A High Efficiency Frequent Itemet Mining Algorithm Baed on the MapReduce Computation Model," Computational Science and Engineering (CSE), 2014 IEEE 17th International Conference on, vol., no., pp.1290,1295, Dec doi: /CSE [16] Moen, S.; Akehirli, E.; Goethal, B., "Frequent Itemet Mining for Big Data," Big Data, 2013 IEEE International Conference on, vol., no., pp.111,118, 6-9 Oct doi: /BigData [17] M. Riondato, J. A. DeBrabant, R. Foneca, and E. Upfal. PARMA: a parallel randomized algorithm for approximate aociation rule mining in MapReduce. In Proc. CIKM, page ACM, [18] M. Malek and H. Kadima. Searching frequent itemet by clutering data: toward a parallel approach uing mapreduce. In Proc. WISE 2011 and 2012 Workhop, page Springer Berlin Heidelberg, [19] R. Agrawal and R. Srikant. Fat algorithm for mining aociation rule in large databae. In Proc. VLDB, page , [20] M. Zaki, S. Parthaarathy, M. Ogihara, and W. Li. Parallel algorithm for dicovery of aociation rule. Data Min. and Knowl. Dic., page , [21] A K Jain, M N Murty, P. J. Flynn, Data Clutering: A Review, ACM COMPUTING SURVEYS, [22] Frequent itemet mining dataet repoitory [23] T. De Bie. An information theoretic framework for data mining. In Proc. ACM SIGKDD, page ,