Searching frequent itemsets by clustering data

Similar documents
Classification On The Clouds Using MapReduce

Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques

Static Data Mining Algorithm with Progressive Approach for Mining Knowledge

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop

An Analysis on Density Based Clustering of Multi Dimensional Spatial Data

Binary Coded Web Access Pattern Tree in Education Domain

Distributed Apriori in Hadoop MapReduce Framework

Analysis of MapReduce Algorithms

A Study on Workload Imbalance Issues in Data Intensive Distributed Computing

Research on Clustering Analysis of Big Data Yuan Yuanming 1, 2, a, Wu Chanle 1, 2

Distributed Framework for Data Mining As a Service on Private Cloud

DEVELOPMENT OF HASH TABLE BASED WEB-READY DATA MINING ENGINE

Optimization and analysis of large scale data sorting algorithm based on Hadoop

CIS4930/6930 Data Science: Large-scale Advanced Data Analysis Fall Daisy Zhe Wang CISE Department University of Florida

Mining Association Rules: A Database Perspective

Big Data Analytics. Lucas Rego Drumond

Performance Analysis of Apriori Algorithm with Different Data Structures on Hadoop Cluster

Big Data and Apache Hadoop s MapReduce

Frequent Itemset Mining for Big Data

Improving Apriori Algorithm to get better performance with Cloud Computing

Finding Frequent Patterns Based On Quantitative Binary Attributes Using FP-Growth Algorithm

Radoop: Analyzing Big Data with RapidMiner and Hadoop

International Journal of World Research, Vol: I Issue XIII, December 2008, Print ISSN: X DATA MINING TECHNIQUES AND STOCK MARKET

Machine Learning Big Data using Map Reduce

Fuzzy Logic -based Pre-processing for Fuzzy Association Rule Mining

Directed Graph based Distributed Sequential Pattern Mining Using Hadoop Map Reduce

FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data

Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce

Hadoop Design and k-means Clustering

SPMF: a Java Open-Source Pattern Mining Library

Distributed forests for MapReduce-based machine learning

Efficient Data Replication Scheme based on Hadoop Distributed File System

MAXIMAL FREQUENT ITEMSET GENERATION USING SEGMENTATION APPROACH

Journal of Chemical and Pharmaceutical Research, 2015, 7(3): Research Article. E-commerce recommendation system on cloud computing

Map-Reduce for Machine Learning on Multicore

PLANET: Massively Parallel Learning of Tree Ensembles with MapReduce. Authors: B. Panda, J. S. Herbach, S. Basu, R. J. Bayardo.

RDB-MINER: A SQL-Based Algorithm for Mining True Relational Databases

Mining Interesting Medical Knowledge from Big Data

What is Analytic Infrastructure and Why Should You Care?

MapReduce on GPUs. Amit Sabne, Ahmad Mujahid Mohammed Razip, Kun Xu

Apriori-Map/Reduce Algorithm

HadoopSPARQL : A Hadoop-based Engine for Multiple SPARQL Query Answering

MapReduce. MapReduce and SQL Injections. CS 3200 Final Lecture. Introduction. MapReduce. Programming Model. Example

Top 10 Algorithms in Data Mining

Top Top 10 Algorithms in Data Mining

A Novel Cloud Computing Data Fragmentation Service Design for Distributed Systems

Advances in Natural and Applied Sciences

MapReduce/Bigtable for Distributed Optimization

Analysing Large Web Log Files in a Hadoop Distributed Cluster Environment

Association Rule Mining

A hybrid algorithm combining weighted and hasht apriori algorithms in Map Reduce model using Eucalyptus cloud platform

A Time Efficient Algorithm for Web Log Analysis

Study of Data Mining algorithm in cloud computing using MapReduce Framework

How To Write A Paper On Bloom Join On A Distributed Database

Mammoth Scale Machine Learning!

IMPLEMENTATION OF P-PIC ALGORITHM IN MAP REDUCE TO HANDLE BIG DATA

Web Log Based Analysis of User s Browsing Behavior

An Efficient Frequent Item Mining using Various Hybrid Data Mining Techniques in Super Market Dataset

Machine Learning using MapReduce

A Breadth-First Algorithm for Mining Frequent Patterns from Event Logs

The Fuzzy Frequent Pattern Tree

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

A Way to Understand Various Patterns of Data Mining Techniques for Selected Domains

Large Scale Multi-view Learning on MapReduce

Software tools for Complex Networks Analysis. Fabrice Huet, University of Nice Sophia- Antipolis SCALE (ex-oasis) Team

SEMANTIC WEB BASED INFERENCE MODEL FOR LARGE SCALE ONTOLOGIES FROM BIG DATA

Big Data With Hadoop

Introduction to Parallel Programming and MapReduce

Association Rule Mining using Apriori Algorithm for Distributed System: a Survey

MINING THE DATA FROM DISTRIBUTED DATABASE USING AN IMPROVED MINING ALGORITHM

The basic data mining algorithms introduced may be enhanced in a number of ways.


Introduction to Hadoop

Keywords: Big Data, HDFS, Map Reduce, Hadoop

A Cost-Benefit Analysis of Indexing Big Data with Map-Reduce

Mining an Online Auctions Data Warehouse

A computational model for MapReduce job flow

Comparison of Data Mining Techniques for Money Laundering Detection System

SCHEDULING IN CLOUD COMPUTING

ASSOCIATION RULE MINING ON WEB LOGS FOR EXTRACTING INTERESTING PATTERNS THROUGH WEKA TOOL

Image Search by MapReduce

Mining Large Datasets: Case of Mining Graph Data in the Cloud

Application of Data Mining Techniques For Diabetic DataSet

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU

Hadoop and Map-reduce computing

Big Data Analytics Hadoop and Spark

Load-Balancing the Distance Computations in Record Linkage

Analysis and Optimization of Massive Data Processing on High Performance Computing Architecture

Research on Job Scheduling Algorithm in Hadoop

Transcription:

Towards a parallel approach using MapReduce Maria Malek Hubert Kadima LARIS-EISTI Ave du Parc, 95011 Cergy-Pontoise, FRANCE maria.malek@eisti.fr, hubert.kadima@eisti.fr

1 Introduction and Related Work 2 Algorithm parameters, structures and definition Algorithm description: sequential version 3 4 K-means implementation Apriori implementation on MapReduce Our proposal 5

Introduction Applying association rules mining leads to find relationships between items in large data bases that contain transactions. The problem of frequent itemsets has been introduced by Agrawal in 1993. This Apriori algorithm is based on the downward closure propriety: if an itemset is not frequent, any superset of it will not be frequent. The Apriori algorithm performs a breadth-first search in the search space by generating candidates of length K + 1 from frequent k-itemsets.

Frequent itemsets - example Transactions 100 1 3 4 200 2 3 5 300 1 2 3 5 400 2 5 Results If minsupp=2 : frequent itemsets : L 1 = {1, 2, 3, 5}, L 2 = {13, 23, 25, 35},L 3 = {235}. Association Rules : R 1 : 2 35 avec conf = 2 3, R 2 : 3 5 avec conf = 2 3, etc.

Downward closure propriety Propriety Let X k be a frequent itemset, all frequent itemsets included in X k are frequent. 1 If ABCD is a frequent itemset, 2 then ABC,ABD, BCD, AB,AC,BC,BD,CD,A,B,C,D are frequent itemsets.

Related work The Apriori algorithm is based on storing database in the memory. When the dataset size is huge, both the memory use and the computational cost still be expensive. The idea is to use a given data structure to achieve a condensed representation of the data transactions (like trees or intervals). Other algorithms (fasters than Apriori): FPGrowth(J.Han, J. Pei, and Y. Yin -2000), Depth first implementation of Apriori (W. A. Kosters and W. Pijls -2003), A massively parallel FP-Growth algorithm implemented with the MapReduce framework (H. Li, Y. Wang, D. Zhang, M. Zhang, and E.Y. Chang-2008).

Algorithm idea Algorithm parameters, structures and definition Algorithm description: sequential version The idea is to start searching from a set of representative examples instead of testing the 1-itemset,the k-itemset and so on. A clustering algorithm is firstly applied in order to cluster the transactions into k clusters. Each cluster is represented by the most representative example. We currently use the k-medoids algorithm in order to cluster the transactions. The set of the k representative examples will be used as the starting point for searching frequent itemsets.

K-means algorithm: illustration Algorithm parameters, structures and definition Algorithm description: sequential version

Algorithm parameters Algorithm parameters, structures and definition Algorithm description: sequential version Input and outpout Input : a set of transactions called D, Output : A list named accepted that contains the retained frequent itemsets. A list named excluded that contains the retained no-frequent itemsets. Parameters K is the initial number of clusters. minsupp is the threshold used for computing frequent itemsets.

Algorithm structures and definition Algorithm parameters, structures and definition Algorithm description: sequential version Structure An intermediate list that we call candidates containing the itemsets to test. Itemsets will be sorted by their decreasing lengths. Definition Global frequent itemsets Let D be a set of transactions. Let L i be an i-itemset of length i, L i is a global frequent itemset iff it is frequent in D. Local frequent itemsets Let D be a set of transaction segmented on k disjoints clusters. Let L i be an i-itemset of length i, L i is a local frequent itemset iff it is frequent in the cluster to whose it belongs.

Algorithm description-1 Algorithm parameters, structures and definition Algorithm description: sequential version 1 Apply the k-medoids on D (the transactions base) and stock the k representative examples as well as the K clusters. 2 Let C 1, C 2,.., C k be the k representative examples sorted by their decreasing lengths, in D. The list candidates is initialized to C 1, C 2,.., C k. 3 While the list candidates Φ do Let C i be the first element of candidates: If C i accepted et C i excluded then 1 If C i is a local frequent itemset then update-accepted(c i ), exit. 2 If C i is a global frequent itemset then update-accepeted(c i ), exit. 3 else, update-excluded (C i ), and add frequent itemsets included in C i to the candidates list.

Algorithm description-2 Algorithm parameters, structures and definition Algorithm description: sequential version The algorithm tests firstly if a given example e is a local frequent itemset, if yes the list called accepted is updated, otherwise the algorithm tests if e is a global frequent itemsest, if yes the list accepted is updated, otherwise the list excluded is updated. update-accepted(c i ): Add to the list accepted, the itemset C i and all the itemsets included in it. update-excluded(c i ): Add to the list excluded, the itemset C i and all the itemsets that include C i.

-1 Data set is composed of a set of navigation logs extracted from the Microsoft site (UCI-machine learning repository) The site is composed of 17 pages with some links between each others that we present by the set of the characters : {A, B, C,.., P, Q}. Initial data logs file contained navigations paths of 388445 users. By keeping only users who have sufficient paths length the users number is reduced to 36014. We call this set of transactions D.

-2 K C 1 C 2 C 3 C 4 C 5 E 1 C 1 E 2 C 2 E 3 C 3 E 4 C 4 E 5 C 5 k=2 ABFG 55% ABDLK 45% k=3 ABFFG 46% ABDKL 23% ABCF 31% k=4 AFG 35% ABDKL 34% ABCFJ 17% ABDFG 14% k=5 AFGJ 11% ABDKL 42% ABCFJ 19% ABDFG 23% BDFGN 5% Table: Results of applying k-medoids on the transactions base D, we report for each value of k the representative example Ei of each cluster Ci as well as the cardinality of the cluster.

-2 Itemset support found by the novel algorithm AFGJ 2406 yes ABCJL 2406 no ABCFJ 1576 yes ABCKL 1922 no ABDFG 2628 yes ABDGL 1813 no ABDKL 1735 yes ABFGJ 1834 no Table: This tables shows all frequent itemsets whose supports are higher than 1576 and whose length is 4 or 5. Four of the eight itemsets have been found by the novel algorithm.

-3 Table: The number of transactions in each cluster when k=4, and for each k the frequency rate of the found representative example. The local and global supports for representative examples when k=4

-4 Figure: The red curve shows the number of frequent itemsets found locally, the green one shows the number of frequent itemsets found globally, and the orange one shows all the found frequent itemsets.

-5 Figure: This figure shows the execution time evolution(micro seconds) in function of K.

MapReduce K-means implementation Apriori implementation on MapReduce Our proposal MapReduce A Framework for parallel and distributed computing that has been introduced by Google in order to handle huge data sets using a large number of computers (nodes). Map Step The master node takes the input, divides it into smaller sub-problems, and distributes them to worker nodes. The worker node processes the smaller problem, and passes the answer back to its master node in the form of list of key-values couples.

MapReduce K-means implementation Apriori implementation on MapReduce Our proposal Map Step The master node takes the input, divides it into smaller sub-problems, and distributes them to worker nodes. The worker node processes the smaller problem, and passes the answer back to its master node in the form of list of key-values couples. Reduce Step The master node then collects the answers to all the sub-problems, and combines for a given key the intermediates values computed by the different mappers in order to form the output or the answer to the problem.

MapReduce : Schema K-means implementation Apriori implementation on MapReduce Our proposal

K-means implementation: scheme K-means implementation Apriori implementation on MapReduce Our proposal Source : horkicky.blogspot.com

K-means implementation Apriori implementation on MapReduce Our proposal K-means implementation: description-1 Master Job Divide the input data into smaller sub-sets, and distribute them to mapper nodes. Randomly choose a list containing k representative examples and sent them to all mappers. Launch a MapReduce job for each iteration until the algorithm convergence (until stabilization of representative examples).

K-means implementation Apriori implementation on MapReduce Our proposal K-means implementation: description-2 Master Job Launch a MapReduce job for each iteration until the algorithm convergence (until stabilization of representative examples). MapReduce 1 Mapper function: compute for each example the closer representative example, and assign it to the associated cluster. 2 Reducer function: collect for each representative example the partial sum of the computed distances from all mappers, and then re-compute the k new representative examples list.

K-medoids implementation: adaptation K-means implementation Apriori implementation on MapReduce Our proposal Master Job Launch a MapReduce job for each iteration until the algorithm convergence (until stabilization of representative examples). MapReduce 1 Mapper function: compute for each example the closer representative example, and assign it to the associated cluster. 2 Reducer function: collect for each couple of (cluster, example) the partial sum of the computed distances from all mappers, and then re-compute the k new representative examples list.

K-means implementation Apriori implementation on MapReduce Our proposal Apriori implementation on MapReduce -1 Map void map(void* map data) for each transaction in map data for (i = 0; i < candidates size; i++) match = false ; itemset = candidates[i] match = itemset exists(transaction, itemset) if (match == true ) emit intermediate(itemset, one) Reduce void reduce(void* key, void** vals, int vals length) count = 0 for (i = 0; i < vals length; i++) count+ = *vals[i] if ( count support level * num transactions)/100.0 emit(key, count)

K-means implementation Apriori implementation on MapReduce Our proposal Apriori implementation on MapReduce-2 Mise jour de la liste des candidats void update frequent candidates(void * reduce data out) j = 0 length = reduce data out length for (j = 0; i < length; j++) temp candidates[j++] = reduce data out key candidates = temp candidates

K-means implementation Apriori implementation on MapReduce Our proposal Parallel implementation of our algorithm Apply the above parallel version of the the k-medoids algorithm in order to obtain the the data set segmented into k clusters. Initialize the list of candidates to the k representative exemples. Re-distribute the k obtained clusters on k mappers, Repeat 1 Send the list of candidates to the k mappers. 2 Each mapper computes the local support of each candidate. 3 The reducer collects all local supports for each candidate and compute for some candidates the global supports if necessary. 4 The master updates accepted, excluded and candidates lists. Until the list of candidates is empty.

Conclusion A new algorithm for searching frequent itemsets in large data bases. The idea is to start searching from a set of representative examples instead of testing the 1-itemset, and so on. The k-medoids clustering algorithm is firstly applied in order to cluster the transactions into k clusters. Each cluster is represented by the most representative example. Experimental results show that beginning the search of frequent itemsets from these representative examples leads to find a significant number of them locally.

Perspectives Update the algorithm in order to find all frequent itemsets. We have proposed parallel version of this algorithm based on the MapReduce Framework: We are now implementing this parallel version and comparing performances with other parallel implementation like the massively parallel FP-Growth algorithm.

Bibliographie I Rakesh Agrawal and Ramakrishnan Srikant. Fast algorithms for mining association rules in large databases. In VLDB, pages 487 499, 1994. Cheng-Tao Chu, Sang Kyun Kim, Yi-An Lin, YuanYuan Yu, Gary R. Bradski, Andrew Y. Ng, and Kunle Olukotun. Map-reduce for machine learning on multicore. In NIPS, pages 281 288, 2006. Jeffrey Dean and Sanjay Ghemawat. Mapreduce: Simplified data processing on large clusters. In OSDI, pages 137 150, 2004. Jiawei Han, Jian Pei, and Yiwen Yin. Mining frequent patterns without candidate generation. SIGMOD Rec., 29(2):1 12, May 2000. Wei Jiang, Vignesh T. Ravi, and Gagan Agrawal. A map-reduce system with an alternate api for multi-core environments. In CCGRID, pages 84 93, 2010. L. Kaufman and P.J. Rousseeuw. Clustering by means of medoids. In Y. Dodge, editor, Statistical Data Analysis Based on the Norm and Related Methods, pages 405,416. North-Holland, 1987.

Bibliographie II Walter A. Kosters and Wim Pijls. Apriori, a depth first implementation. In Proc. of the Workshop on Frequent Itemset Mining Implementations, 2003. Haoyuan Li, Yi Wang, Dong Zhang, Ming Zhang, and Edward Y. Chang. Pfp: parallel fp-growth for query recommendation. In Proceedings of the 2008 ACM conference on Recommender systems, RecSys 08, pages 107 114, New York, NY, USA, 2008. ACM. Sean Owen, Robin Anil, Ted Dunning, and Ellen Friedman. Mahout in Action. Manning Publications, 1 edition, January 2011. Mingjun Song and Sanguthevar Rajasekaran. A transaction mapping algorithm for frequent itemsets mining. IEEE Transactions on Knowledge and Data Engineering, 18:472 481, 2006. Weizhong Zhao, Huifang Ma, and Qing He. Parallel k-means clustering based on mapreduce. In CloudCom, pages 674 679, 2009.