Searching frequent itemsets by clustering data
|
|
|
- Emery Fitzgerald
- 10 years ago
- Views:
Transcription
1 Towards a parallel approach using MapReduce Maria Malek Hubert Kadima LARIS-EISTI Ave du Parc, Cergy-Pontoise, FRANCE [email protected], [email protected]
2 1 Introduction and Related Work 2 Algorithm parameters, structures and definition Algorithm description: sequential version 3 4 K-means implementation Apriori implementation on MapReduce Our proposal 5
3 Introduction Applying association rules mining leads to find relationships between items in large data bases that contain transactions. The problem of frequent itemsets has been introduced by Agrawal in This Apriori algorithm is based on the downward closure propriety: if an itemset is not frequent, any superset of it will not be frequent. The Apriori algorithm performs a breadth-first search in the search space by generating candidates of length K + 1 from frequent k-itemsets.
4 Frequent itemsets - example Transactions Results If minsupp=2 : frequent itemsets : L 1 = {1, 2, 3, 5}, L 2 = {13, 23, 25, 35},L 3 = {235}. Association Rules : R 1 : 2 35 avec conf = 2 3, R 2 : 3 5 avec conf = 2 3, etc.
5 Downward closure propriety Propriety Let X k be a frequent itemset, all frequent itemsets included in X k are frequent. 1 If ABCD is a frequent itemset, 2 then ABC,ABD, BCD, AB,AC,BC,BD,CD,A,B,C,D are frequent itemsets.
6 Related work The Apriori algorithm is based on storing database in the memory. When the dataset size is huge, both the memory use and the computational cost still be expensive. The idea is to use a given data structure to achieve a condensed representation of the data transactions (like trees or intervals). Other algorithms (fasters than Apriori): FPGrowth(J.Han, J. Pei, and Y. Yin -2000), Depth first implementation of Apriori (W. A. Kosters and W. Pijls -2003), A massively parallel FP-Growth algorithm implemented with the MapReduce framework (H. Li, Y. Wang, D. Zhang, M. Zhang, and E.Y. Chang-2008).
7 Algorithm idea Algorithm parameters, structures and definition Algorithm description: sequential version The idea is to start searching from a set of representative examples instead of testing the 1-itemset,the k-itemset and so on. A clustering algorithm is firstly applied in order to cluster the transactions into k clusters. Each cluster is represented by the most representative example. We currently use the k-medoids algorithm in order to cluster the transactions. The set of the k representative examples will be used as the starting point for searching frequent itemsets.
8 K-means algorithm: illustration Algorithm parameters, structures and definition Algorithm description: sequential version
9 Algorithm parameters Algorithm parameters, structures and definition Algorithm description: sequential version Input and outpout Input : a set of transactions called D, Output : A list named accepted that contains the retained frequent itemsets. A list named excluded that contains the retained no-frequent itemsets. Parameters K is the initial number of clusters. minsupp is the threshold used for computing frequent itemsets.
10 Algorithm structures and definition Algorithm parameters, structures and definition Algorithm description: sequential version Structure An intermediate list that we call candidates containing the itemsets to test. Itemsets will be sorted by their decreasing lengths. Definition Global frequent itemsets Let D be a set of transactions. Let L i be an i-itemset of length i, L i is a global frequent itemset iff it is frequent in D. Local frequent itemsets Let D be a set of transaction segmented on k disjoints clusters. Let L i be an i-itemset of length i, L i is a local frequent itemset iff it is frequent in the cluster to whose it belongs.
11 Algorithm description-1 Algorithm parameters, structures and definition Algorithm description: sequential version 1 Apply the k-medoids on D (the transactions base) and stock the k representative examples as well as the K clusters. 2 Let C 1, C 2,.., C k be the k representative examples sorted by their decreasing lengths, in D. The list candidates is initialized to C 1, C 2,.., C k. 3 While the list candidates Φ do Let C i be the first element of candidates: If C i accepted et C i excluded then 1 If C i is a local frequent itemset then update-accepted(c i ), exit. 2 If C i is a global frequent itemset then update-accepeted(c i ), exit. 3 else, update-excluded (C i ), and add frequent itemsets included in C i to the candidates list.
12 Algorithm description-2 Algorithm parameters, structures and definition Algorithm description: sequential version The algorithm tests firstly if a given example e is a local frequent itemset, if yes the list called accepted is updated, otherwise the algorithm tests if e is a global frequent itemsest, if yes the list accepted is updated, otherwise the list excluded is updated. update-accepted(c i ): Add to the list accepted, the itemset C i and all the itemsets included in it. update-excluded(c i ): Add to the list excluded, the itemset C i and all the itemsets that include C i.
13 -1 Data set is composed of a set of navigation logs extracted from the Microsoft site (UCI-machine learning repository) The site is composed of 17 pages with some links between each others that we present by the set of the characters : {A, B, C,.., P, Q}. Initial data logs file contained navigations paths of users. By keeping only users who have sufficient paths length the users number is reduced to We call this set of transactions D.
14 -2 K C 1 C 2 C 3 C 4 C 5 E 1 C 1 E 2 C 2 E 3 C 3 E 4 C 4 E 5 C 5 k=2 ABFG 55% ABDLK 45% k=3 ABFFG 46% ABDKL 23% ABCF 31% k=4 AFG 35% ABDKL 34% ABCFJ 17% ABDFG 14% k=5 AFGJ 11% ABDKL 42% ABCFJ 19% ABDFG 23% BDFGN 5% Table: Results of applying k-medoids on the transactions base D, we report for each value of k the representative example Ei of each cluster Ci as well as the cardinality of the cluster.
15 -2 Itemset support found by the novel algorithm AFGJ 2406 yes ABCJL 2406 no ABCFJ 1576 yes ABCKL 1922 no ABDFG 2628 yes ABDGL 1813 no ABDKL 1735 yes ABFGJ 1834 no Table: This tables shows all frequent itemsets whose supports are higher than 1576 and whose length is 4 or 5. Four of the eight itemsets have been found by the novel algorithm.
16 -3 Table: The number of transactions in each cluster when k=4, and for each k the frequency rate of the found representative example. The local and global supports for representative examples when k=4
17 -4 Figure: The red curve shows the number of frequent itemsets found locally, the green one shows the number of frequent itemsets found globally, and the orange one shows all the found frequent itemsets.
18 -5 Figure: This figure shows the execution time evolution(micro seconds) in function of K.
19 MapReduce K-means implementation Apriori implementation on MapReduce Our proposal MapReduce A Framework for parallel and distributed computing that has been introduced by Google in order to handle huge data sets using a large number of computers (nodes). Map Step The master node takes the input, divides it into smaller sub-problems, and distributes them to worker nodes. The worker node processes the smaller problem, and passes the answer back to its master node in the form of list of key-values couples.
20 MapReduce K-means implementation Apriori implementation on MapReduce Our proposal Map Step The master node takes the input, divides it into smaller sub-problems, and distributes them to worker nodes. The worker node processes the smaller problem, and passes the answer back to its master node in the form of list of key-values couples. Reduce Step The master node then collects the answers to all the sub-problems, and combines for a given key the intermediates values computed by the different mappers in order to form the output or the answer to the problem.
21 MapReduce : Schema K-means implementation Apriori implementation on MapReduce Our proposal
22 K-means implementation: scheme K-means implementation Apriori implementation on MapReduce Our proposal Source : horkicky.blogspot.com
23 K-means implementation Apriori implementation on MapReduce Our proposal K-means implementation: description-1 Master Job Divide the input data into smaller sub-sets, and distribute them to mapper nodes. Randomly choose a list containing k representative examples and sent them to all mappers. Launch a MapReduce job for each iteration until the algorithm convergence (until stabilization of representative examples).
24 K-means implementation Apriori implementation on MapReduce Our proposal K-means implementation: description-2 Master Job Launch a MapReduce job for each iteration until the algorithm convergence (until stabilization of representative examples). MapReduce 1 Mapper function: compute for each example the closer representative example, and assign it to the associated cluster. 2 Reducer function: collect for each representative example the partial sum of the computed distances from all mappers, and then re-compute the k new representative examples list.
25 K-medoids implementation: adaptation K-means implementation Apriori implementation on MapReduce Our proposal Master Job Launch a MapReduce job for each iteration until the algorithm convergence (until stabilization of representative examples). MapReduce 1 Mapper function: compute for each example the closer representative example, and assign it to the associated cluster. 2 Reducer function: collect for each couple of (cluster, example) the partial sum of the computed distances from all mappers, and then re-compute the k new representative examples list.
26 K-means implementation Apriori implementation on MapReduce Our proposal Apriori implementation on MapReduce -1 Map void map(void* map data) for each transaction in map data for (i = 0; i < candidates size; i++) match = false ; itemset = candidates[i] match = itemset exists(transaction, itemset) if (match == true ) emit intermediate(itemset, one) Reduce void reduce(void* key, void** vals, int vals length) count = 0 for (i = 0; i < vals length; i++) count+ = *vals[i] if ( count support level * num transactions)/100.0 emit(key, count)
27 K-means implementation Apriori implementation on MapReduce Our proposal Apriori implementation on MapReduce-2 Mise jour de la liste des candidats void update frequent candidates(void * reduce data out) j = 0 length = reduce data out length for (j = 0; i < length; j++) temp candidates[j++] = reduce data out key candidates = temp candidates
28 K-means implementation Apriori implementation on MapReduce Our proposal Parallel implementation of our algorithm Apply the above parallel version of the the k-medoids algorithm in order to obtain the the data set segmented into k clusters. Initialize the list of candidates to the k representative exemples. Re-distribute the k obtained clusters on k mappers, Repeat 1 Send the list of candidates to the k mappers. 2 Each mapper computes the local support of each candidate. 3 The reducer collects all local supports for each candidate and compute for some candidates the global supports if necessary. 4 The master updates accepted, excluded and candidates lists. Until the list of candidates is empty.
29 Conclusion A new algorithm for searching frequent itemsets in large data bases. The idea is to start searching from a set of representative examples instead of testing the 1-itemset, and so on. The k-medoids clustering algorithm is firstly applied in order to cluster the transactions into k clusters. Each cluster is represented by the most representative example. Experimental results show that beginning the search of frequent itemsets from these representative examples leads to find a significant number of them locally.
30 Perspectives Update the algorithm in order to find all frequent itemsets. We have proposed parallel version of this algorithm based on the MapReduce Framework: We are now implementing this parallel version and comparing performances with other parallel implementation like the massively parallel FP-Growth algorithm.
31 Bibliographie I Rakesh Agrawal and Ramakrishnan Srikant. Fast algorithms for mining association rules in large databases. In VLDB, pages , Cheng-Tao Chu, Sang Kyun Kim, Yi-An Lin, YuanYuan Yu, Gary R. Bradski, Andrew Y. Ng, and Kunle Olukotun. Map-reduce for machine learning on multicore. In NIPS, pages , Jeffrey Dean and Sanjay Ghemawat. Mapreduce: Simplified data processing on large clusters. In OSDI, pages , Jiawei Han, Jian Pei, and Yiwen Yin. Mining frequent patterns without candidate generation. SIGMOD Rec., 29(2):1 12, May Wei Jiang, Vignesh T. Ravi, and Gagan Agrawal. A map-reduce system with an alternate api for multi-core environments. In CCGRID, pages 84 93, L. Kaufman and P.J. Rousseeuw. Clustering by means of medoids. In Y. Dodge, editor, Statistical Data Analysis Based on the Norm and Related Methods, pages 405,416. North-Holland, 1987.
32 Bibliographie II Walter A. Kosters and Wim Pijls. Apriori, a depth first implementation. In Proc. of the Workshop on Frequent Itemset Mining Implementations, Haoyuan Li, Yi Wang, Dong Zhang, Ming Zhang, and Edward Y. Chang. Pfp: parallel fp-growth for query recommendation. In Proceedings of the 2008 ACM conference on Recommender systems, RecSys 08, pages , New York, NY, USA, ACM. Sean Owen, Robin Anil, Ted Dunning, and Ellen Friedman. Mahout in Action. Manning Publications, 1 edition, January Mingjun Song and Sanguthevar Rajasekaran. A transaction mapping algorithm for frequent itemsets mining. IEEE Transactions on Knowledge and Data Engineering, 18: , Weizhong Zhao, Huifang Ma, and Qing He. Parallel k-means clustering based on mapreduce. In CloudCom, pages , 2009.
Classification On The Clouds Using MapReduce
Classification On The Clouds Using MapReduce Simão Martins Instituto Superior Técnico Lisbon, Portugal [email protected] Cláudia Antunes Instituto Superior Técnico Lisbon, Portugal [email protected]
Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques
Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques Subhashree K 1, Prakash P S 2 1 Student, Kongu Engineering College, Perundurai, Erode 2 Assistant Professor,
Static Data Mining Algorithm with Progressive Approach for Mining Knowledge
Global Journal of Business Management and Information Technology. Volume 1, Number 2 (2011), pp. 85-93 Research India Publications http://www.ripublication.com Static Data Mining Algorithm with Progressive
Large-Scale Data Sets Clustering Based on MapReduce and Hadoop
Journal of Computational Information Systems 7: 16 (2011) 5956-5963 Available at http://www.jofcis.com Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Ping ZHOU, Jingsheng LEI, Wenjun YE
An Analysis on Density Based Clustering of Multi Dimensional Spatial Data
An Analysis on Density Based Clustering of Multi Dimensional Spatial Data K. Mumtaz 1 Assistant Professor, Department of MCA Vivekanandha Institute of Information and Management Studies, Tiruchengode,
Binary Coded Web Access Pattern Tree in Education Domain
Binary Coded Web Access Pattern Tree in Education Domain C. Gomathi P.G. Department of Computer Science Kongu Arts and Science College Erode-638-107, Tamil Nadu, India E-mail: [email protected] M. Moorthi
Distributed Apriori in Hadoop MapReduce Framework
Distributed Apriori in Hadoop MapReduce Framework By Shulei Zhao (sz2352) and Rongxin Du (rd2537) Individual Contribution: Shulei Zhao: Implements centralized Apriori algorithm and input preprocessing
Analysis of MapReduce Algorithms
Analysis of MapReduce Algorithms Harini Padmanaban Computer Science Department San Jose State University San Jose, CA 95192 408-924-1000 [email protected] ABSTRACT MapReduce is a programming model
A Study on Workload Imbalance Issues in Data Intensive Distributed Computing
A Study on Workload Imbalance Issues in Data Intensive Distributed Computing Sven Groot 1, Kazuo Goda 1, and Masaru Kitsuregawa 1 University of Tokyo, 4-6-1 Komaba, Meguro-ku, Tokyo 153-8505, Japan Abstract.
Research on Clustering Analysis of Big Data Yuan Yuanming 1, 2, a, Wu Chanle 1, 2
Advanced Engineering Forum Vols. 6-7 (2012) pp 82-87 Online: 2012-09-26 (2012) Trans Tech Publications, Switzerland doi:10.4028/www.scientific.net/aef.6-7.82 Research on Clustering Analysis of Big Data
Distributed Framework for Data Mining As a Service on Private Cloud
RESEARCH ARTICLE OPEN ACCESS Distributed Framework for Data Mining As a Service on Private Cloud Shraddha Masih *, Sanjay Tanwani** *Research Scholar & Associate Professor, School of Computer Science &
DEVELOPMENT OF HASH TABLE BASED WEB-READY DATA MINING ENGINE
DEVELOPMENT OF HASH TABLE BASED WEB-READY DATA MINING ENGINE SK MD OBAIDULLAH Department of Computer Science & Engineering, Aliah University, Saltlake, Sector-V, Kol-900091, West Bengal, India [email protected]
Optimization and analysis of large scale data sorting algorithm based on Hadoop
Optimization and analysis of large scale sorting algorithm based on Hadoop Zhuo Wang, Longlong Tian, Dianjie Guo, Xiaoming Jiang Institute of Information Engineering, Chinese Academy of Sciences {wangzhuo,
CIS4930/6930 Data Science: Large-scale Advanced Data Analysis Fall 2011. Daisy Zhe Wang CISE Department University of Florida
CIS4930/6930 Data Science: Large-scale Advanced Data Analysis Fall 2011 Daisy Zhe Wang CISE Department University of Florida 1 Vital Information Instructor: Daisy Zhe Wang Office: E456 Class time: Tuesdays
Mining Association Rules: A Database Perspective
IJCSNS International Journal of Computer Science and Network Security, VOL.8 No.12, December 2008 69 Mining Association Rules: A Database Perspective Dr. Abdallah Alashqur Faculty of Information Technology
Big Data Analytics. Lucas Rego Drumond
Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany MapReduce II MapReduce II 1 / 33 Outline 1. Introduction
Performance Analysis of Apriori Algorithm with Different Data Structures on Hadoop Cluster
Performance Analysis of Apriori Algorithm with Different Data Structures on Hadoop Cluster Sudhakar Singh Dept. of Computer Science Faculty of Science Banaras Hindu University Rakhi Garg Dept. of Computer
Big Data and Apache Hadoop s MapReduce
Big Data and Apache Hadoop s MapReduce Michael Hahsler Computer Science and Engineering Southern Methodist University January 23, 2012 Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 1 / 23
Frequent Itemset Mining for Big Data
Frequent Itemset Mining for Big Data Sandy Moens, Emin Aksehirli and Bart Goethals Universiteit Antwerpen, Belgium Email: [email protected] Abstract Frequent Itemset Mining (FIM) is one
Improving Apriori Algorithm to get better performance with Cloud Computing
Improving Apriori Algorithm to get better performance with Cloud Computing Zeba Qureshi 1 ; Sanjay Bansal 2 Affiliation: A.I.T.R, RGPV, India 1, A.I.T.R, RGPV, India 2 ABSTRACT Cloud computing has become
Finding Frequent Patterns Based On Quantitative Binary Attributes Using FP-Growth Algorithm
R. Sridevi et al Int. Journal of Engineering Research and Applications RESEARCH ARTICLE OPEN ACCESS Finding Frequent Patterns Based On Quantitative Binary Attributes Using FP-Growth Algorithm R. Sridevi,*
Radoop: Analyzing Big Data with RapidMiner and Hadoop
Radoop: Analyzing Big Data with RapidMiner and Hadoop Zoltán Prekopcsák, Gábor Makrai, Tamás Henk, Csaba Gáspár-Papanek Budapest University of Technology and Economics, Hungary Abstract Working with large
International Journal of World Research, Vol: I Issue XIII, December 2008, Print ISSN: 2347-937X DATA MINING TECHNIQUES AND STOCK MARKET
DATA MINING TECHNIQUES AND STOCK MARKET Mr. Rahul Thakkar, Lecturer and HOD, Naran Lala College of Professional & Applied Sciences, Navsari ABSTRACT Without trading in a stock market we can t understand
Machine Learning Big Data using Map Reduce
Machine Learning Big Data using Map Reduce By Michael Bowles, PhD Where Does Big Data Come From? -Web data (web logs, click histories) -e-commerce applications (purchase histories) -Retail purchase histories
Fuzzy Logic -based Pre-processing for Fuzzy Association Rule Mining
Fuzzy Logic -based Pre-processing for Fuzzy Association Rule Mining by Ashish Mangalampalli, Vikram Pudi Report No: IIIT/TR/2008/127 Centre for Data Engineering International Institute of Information Technology
Directed Graph based Distributed Sequential Pattern Mining Using Hadoop Map Reduce
Directed Graph based Distributed Sequential Pattern Mining Using Hadoop Map Reduce Sushila S. Shelke, Suhasini A. Itkar, PES s Modern College of Engineering, Shivajinagar, Pune Abstract - Usual sequential
FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data
FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data Miguel Liroz-Gistau, Reza Akbarinia, Patrick Valduriez To cite this version: Miguel Liroz-Gistau, Reza Akbarinia, Patrick Valduriez. FP-Hadoop:
Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce
Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce Mohammad Farhan Husain, Pankil Doshi, Latifur Khan, and Bhavani Thuraisingham University of Texas at Dallas, Dallas TX 75080, USA Abstract.
Hadoop Design and k-means Clustering
Hadoop Design and k-means Clustering Kenneth Heafield Google Inc January 15, 2008 Example code from Hadoop 0.13.1 used under the Apache License Version 2.0 and modified for presentation. Except as otherwise
SPMF: a Java Open-Source Pattern Mining Library
Journal of Machine Learning Research 1 (2014) 1-5 Submitted 4/12; Published 10/14 SPMF: a Java Open-Source Pattern Mining Library Philippe Fournier-Viger [email protected] Department
Distributed forests for MapReduce-based machine learning
Distributed forests for MapReduce-based machine learning Ryoji Wakayama, Ryuei Murata, Akisato Kimura, Takayoshi Yamashita, Yuji Yamauchi, Hironobu Fujiyoshi Chubu University, Japan. NTT Communication
Efficient Data Replication Scheme based on Hadoop Distributed File System
, pp. 177-186 http://dx.doi.org/10.14257/ijseia.2015.9.12.16 Efficient Data Replication Scheme based on Hadoop Distributed File System Jungha Lee 1, Jaehwa Chung 2 and Daewon Lee 3* 1 Division of Supercomputing,
MAXIMAL FREQUENT ITEMSET GENERATION USING SEGMENTATION APPROACH
MAXIMAL FREQUENT ITEMSET GENERATION USING SEGMENTATION APPROACH M.Rajalakshmi 1, Dr.T.Purusothaman 2, Dr.R.Nedunchezhian 3 1 Assistant Professor (SG), Coimbatore Institute of Technology, India, [email protected]
Journal of Chemical and Pharmaceutical Research, 2015, 7(3):1388-1392. Research Article. E-commerce recommendation system on cloud computing
Available online www.jocpr.com Journal of Chemical and Pharmaceutical Research, 2015, 7(3):1388-1392 Research Article ISSN : 0975-7384 CODEN(USA) : JCPRC5 E-commerce recommendation system on cloud computing
Map-Reduce for Machine Learning on Multicore
Map-Reduce for Machine Learning on Multicore Chu, et al. Problem The world is going multicore New computers - dual core to 12+-core Shift to more concurrent programming paradigms and languages Erlang,
PLANET: Massively Parallel Learning of Tree Ensembles with MapReduce. Authors: B. Panda, J. S. Herbach, S. Basu, R. J. Bayardo.
PLANET: Massively Parallel Learning of Tree Ensembles with MapReduce Authors: B. Panda, J. S. Herbach, S. Basu, R. J. Bayardo. VLDB 2009 CS 422 Decision Trees: Main Components Find Best Split Choose split
RDB-MINER: A SQL-Based Algorithm for Mining True Relational Databases
998 JOURNAL OF SOFTWARE, VOL. 5, NO. 9, SEPTEMBER 2010 RDB-MINER: A SQL-Based Algorithm for Mining True Relational Databases Abdallah Alashqur Faculty of Information Technology Applied Science University
Mining Interesting Medical Knowledge from Big Data
IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661,p-ISSN: 2278-8727, Volume 18, Issue 1, Ver. II (Jan Feb. 2016), PP 06-10 www.iosrjournals.org Mining Interesting Medical Knowledge from
What is Analytic Infrastructure and Why Should You Care?
What is Analytic Infrastructure and Why Should You Care? Robert L Grossman University of Illinois at Chicago and Open Data Group [email protected] ABSTRACT We define analytic infrastructure to be the services,
MapReduce on GPUs. Amit Sabne, Ahmad Mujahid Mohammed Razip, Kun Xu
1 MapReduce on GPUs Amit Sabne, Ahmad Mujahid Mohammed Razip, Kun Xu 2 MapReduce MAP Shuffle Reduce 3 Hadoop Open-source MapReduce framework from Apache, written in Java Used by Yahoo!, Facebook, Ebay,
Apriori-Map/Reduce Algorithm
Apriori-Map/Reduce Algorithm Jongwook Woo Computer Information Systems Department California State University Los Angeles, CA Abstract Map/Reduce algorithm has received highlights as cloud computing services
HadoopSPARQL : A Hadoop-based Engine for Multiple SPARQL Query Answering
HadoopSPARQL : A Hadoop-based Engine for Multiple SPARQL Query Answering Chang Liu 1 Jun Qu 1 Guilin Qi 2 Haofen Wang 1 Yong Yu 1 1 Shanghai Jiaotong University, China {liuchang,qujun51319, whfcarter,yyu}@apex.sjtu.edu.cn
MapReduce. MapReduce and SQL Injections. CS 3200 Final Lecture. Introduction. MapReduce. Programming Model. Example
MapReduce MapReduce and SQL Injections CS 3200 Final Lecture Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. OSDI'04: Sixth Symposium on Operating System Design
Top 10 Algorithms in Data Mining
Top 10 Algorithms in Data Mining Xindong Wu ( 吴 信 东 ) Department of Computer Science University of Vermont, USA; 合 肥 工 业 大 学 计 算 机 与 信 息 学 院 1 Top 10 Algorithms in Data Mining by the IEEE ICDM Conference
Top Top 10 Algorithms in Data Mining
ICDM 06 Panel on Top Top 10 Algorithms in Data Mining 1. The 3-step identification process 2. The 18 identified candidates 3. Algorithm presentations 4. Top 10 algorithms: summary 5. Open discussions ICDM
A Novel Cloud Computing Data Fragmentation Service Design for Distributed Systems
A Novel Cloud Computing Data Fragmentation Service Design for Distributed Systems Ismail Hababeh School of Computer Engineering and Information Technology, German-Jordanian University Amman, Jordan Abstract-
Advances in Natural and Applied Sciences
AENSI Journals Advances in Natural and Applied Sciences ISSN:1995-0772 EISSN: 1998-1090 Journal home page: www.aensiweb.com/anas Clustering Algorithm Based On Hadoop for Big Data 1 Jayalatchumy D. and
MapReduce/Bigtable for Distributed Optimization
MapReduce/Bigtable for Distributed Optimization Keith B. Hall Google Inc. [email protected] Scott Gilpin Google Inc. [email protected] Gideon Mann Google Inc. [email protected] Abstract With large data
Analysing Large Web Log Files in a Hadoop Distributed Cluster Environment
Analysing Large Files in a Hadoop Distributed Cluster Environment S Saravanan, B Uma Maheswari Department of Computer Science and Engineering, Amrita School of Engineering, Amrita Vishwa Vidyapeetham,
Association Rule Mining
Association Rule Mining Association Rules and Frequent Patterns Frequent Pattern Mining Algorithms Apriori FP-growth Correlation Analysis Constraint-based Mining Using Frequent Patterns for Classification
A hybrid algorithm combining weighted and hasht apriori algorithms in Map Reduce model using Eucalyptus cloud platform
A hybrid algorithm combining weighted and hasht apriori algorithms in Map Reduce model using Eucalyptus cloud platform 1 R. SUMITHRA, 2 SUJNI PAUL AND 3 D. PONMARY PUSHPA LATHA 1 School of Computer Science,
A Time Efficient Algorithm for Web Log Analysis
A Time Efficient Algorithm for Web Log Analysis Santosh Shakya Anju Singh Divakar Singh Student [M.Tech.6 th sem (CSE)] Asst.Proff, Dept. of CSE BU HOD (CSE), BUIT, BUIT,BU Bhopal Barkatullah University,
Study of Data Mining algorithm in cloud computing using MapReduce Framework
Study of Data Mining algorithm in cloud computing using MapReduce Framework Viki Patil, M.Tech, Department of Computer Engineering and Information Technology, V.J.T.I., Mumbai Prof. V. B. Nikam, Professor,
How To Write A Paper On Bloom Join On A Distributed Database
Research Paper BLOOM JOIN FINE-TUNES DISTRIBUTED QUERY IN HADOOP ENVIRONMENT Dr. Sunita M. Mahajan 1 and Ms. Vaishali P. Jadhav 2 Address for Correspondence 1 Principal, Mumbai Education Trust, Bandra,
Mammoth Scale Machine Learning!
Mammoth Scale Machine Learning! Speaker: Robin Anil, Apache Mahout PMC Member! OSCON"10! Portland, OR! July 2010! Quick Show of Hands!# Are you fascinated about ML?!# Have you used ML?!# Do you have Gigabytes
IMPLEMENTATION OF P-PIC ALGORITHM IN MAP REDUCE TO HANDLE BIG DATA
IMPLEMENTATION OF P-PIC ALGORITHM IN MAP REDUCE TO HANDLE BIG DATA Jayalatchumy D 1, Thambidurai. P 2 Abstract Clustering is a process of grouping objects that are similar among themselves but dissimilar
Web Log Based Analysis of User s Browsing Behavior
Web Log Based Analysis of User s Browsing Behavior Ashwini Ladekar 1, Dhanashree Raikar 2,Pooja Pawar 3 B.E Student, Department of Computer, JSPM s BSIOTR, Wagholi,Pune, India 1 B.E Student, Department
An Efficient Frequent Item Mining using Various Hybrid Data Mining Techniques in Super Market Dataset
An Efficient Frequent Item Mining using Various Hybrid Data Mining Techniques in Super Market Dataset P.Abinaya 1, Dr. (Mrs) D.Suganyadevi 2 M.Phil. Scholar 1, Department of Computer Science,STC,Pollachi
Machine Learning using MapReduce
Machine Learning using MapReduce What is Machine Learning Machine learning is a subfield of artificial intelligence concerned with techniques that allow computers to improve their outputs based on previous
A Breadth-First Algorithm for Mining Frequent Patterns from Event Logs
A Breadth-First Algorithm for Mining Frequent Patterns from Event Logs Risto Vaarandi Department of Computer Engineering, Tallinn University of Technology, Estonia [email protected] Abstract. Today,
The Fuzzy Frequent Pattern Tree
The Fuzzy Frequent Pattern Tree STERGIOS PAPADIMITRIOU 1 SEFERINA MAVROUDI 2 1. Department of Information Management, Technological Educational Institute of Kavala, 65404 Kavala, Greece 2. Pattern Recognition
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model
A Way to Understand Various Patterns of Data Mining Techniques for Selected Domains
A Way to Understand Various Patterns of Data Mining Techniques for Selected Domains Dr. Kanak Saxena Professor & Head, Computer Application SATI, Vidisha, [email protected] D.S. Rajpoot Registrar,
Large Scale Multi-view Learning on MapReduce
Large Scale Multi-view Learning on MapReduce Cibe Hariharan Ericsson Research India Email: [email protected] Shivashankar Subramanian Ericsson Research India Email: [email protected] Abstract
Software tools for Complex Networks Analysis. Fabrice Huet, University of Nice Sophia- Antipolis SCALE (ex-oasis) Team
Software tools for Complex Networks Analysis Fabrice Huet, University of Nice Sophia- Antipolis SCALE (ex-oasis) Team MOTIVATION Why do we need tools? Source : nature.com Visualization Properties extraction
SEMANTIC WEB BASED INFERENCE MODEL FOR LARGE SCALE ONTOLOGIES FROM BIG DATA
SEMANTIC WEB BASED INFERENCE MODEL FOR LARGE SCALE ONTOLOGIES FROM BIG DATA J.RAVI RAJESH PG Scholar Rajalakshmi engineering college Thandalam, Chennai. [email protected] Mrs.
Big Data With Hadoop
With Saurabh Singh [email protected] The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials
Introduction to Parallel Programming and MapReduce
Introduction to Parallel Programming and MapReduce Audience and Pre-Requisites This tutorial covers the basics of parallel programming and the MapReduce programming model. The pre-requisites are significant
Association Rule Mining using Apriori Algorithm for Distributed System: a Survey
IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661, p- ISSN: 2278-8727Volume 16, Issue 2, Ver. VIII (Mar-Apr. 2014), PP 112-118 Association Rule Mining using Apriori Algorithm for Distributed
MINING THE DATA FROM DISTRIBUTED DATABASE USING AN IMPROVED MINING ALGORITHM
MINING THE DATA FROM DISTRIBUTED DATABASE USING AN IMPROVED MINING ALGORITHM J. Arokia Renjit Asst. Professor/ CSE Department, Jeppiaar Engineering College, Chennai, TamilNadu,India 600119. Dr.K.L.Shunmuganathan
The basic data mining algorithms introduced may be enhanced in a number of ways.
DATA MINING TECHNOLOGIES AND IMPLEMENTATIONS The basic data mining algorithms introduced may be enhanced in a number of ways. Data mining algorithms have traditionally assumed data is memory resident,
http://www.wordle.net/
Hadoop & MapReduce http://www.wordle.net/ http://www.wordle.net/ Hadoop is an open-source software framework (or platform) for Reliable + Scalable + Distributed Storage/Computational unit Failures completely
Introduction to Hadoop
Introduction to Hadoop 1 What is Hadoop? the big data revolution extracting value from data cloud computing 2 Understanding MapReduce the word count problem more examples MCS 572 Lecture 24 Introduction
Keywords: Big Data, HDFS, Map Reduce, Hadoop
Volume 5, Issue 7, July 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Configuration Tuning
A Cost-Benefit Analysis of Indexing Big Data with Map-Reduce
A Cost-Benefit Analysis of Indexing Big Data with Map-Reduce Dimitrios Siafarikas Argyrios Samourkasidis Avi Arampatzis Department of Electrical and Computer Engineering Democritus University of Thrace
Mining an Online Auctions Data Warehouse
Proceedings of MASPLAS'02 The Mid-Atlantic Student Workshop on Programming Languages and Systems Pace University, April 19, 2002 Mining an Online Auctions Data Warehouse David Ulmer Under the guidance
A computational model for MapReduce job flow
A computational model for MapReduce job flow Tommaso Di Noia, Marina Mongiello, Eugenio Di Sciascio Dipartimento di Ingegneria Elettrica e Dell informazione Politecnico di Bari Via E. Orabona, 4 70125
Comparison of Data Mining Techniques for Money Laundering Detection System
Comparison of Data Mining Techniques for Money Laundering Detection System Rafał Dreżewski, Grzegorz Dziuban, Łukasz Hernik, Michał Pączek AGH University of Science and Technology, Department of Computer
SCHEDULING IN CLOUD COMPUTING
SCHEDULING IN CLOUD COMPUTING Lipsa Tripathy, Rasmi Ranjan Patra CSA,CPGS,OUAT,Bhubaneswar,Odisha Abstract Cloud computing is an emerging technology. It process huge amount of data so scheduling mechanism
ASSOCIATION RULE MINING ON WEB LOGS FOR EXTRACTING INTERESTING PATTERNS THROUGH WEKA TOOL
International Journal Of Advanced Technology In Engineering And Science Www.Ijates.Com Volume No 03, Special Issue No. 01, February 2015 ISSN (Online): 2348 7550 ASSOCIATION RULE MINING ON WEB LOGS FOR
Image Search by MapReduce
Image Search by MapReduce COEN 241 Cloud Computing Term Project Final Report Team #5 Submitted by: Lu Yu Zhe Xu Chengcheng Huang Submitted to: Prof. Ming Hwa Wang 09/01/2015 Preface Currently, there s
Mining Large Datasets: Case of Mining Graph Data in the Cloud
Mining Large Datasets: Case of Mining Graph Data in the Cloud Sabeur Aridhi PhD in Computer Science with Laurent d Orazio, Mondher Maddouri and Engelbert Mephu Nguifo 16/05/2014 Sabeur Aridhi Mining Large
Application of Data Mining Techniques For Diabetic DataSet
Computing For Nation Development, February 25 26, 2010 Bharati Vidyapeeth s Institute of Computer Applications and Management, New Delhi Application of Data Mining Techniques For DataSet 1 Runumi Devi
Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU
Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU Heshan Li, Shaopeng Wang The Johns Hopkins University 3400 N. Charles Street Baltimore, Maryland 21218 {heshanli, shaopeng}@cs.jhu.edu 1 Overview
Hadoop and Map-reduce computing
Hadoop and Map-reduce computing 1 Introduction This activity contains a great deal of background information and detailed instructions so that you can refer to it later for further activities and homework.
Big Data Analytics Hadoop and Spark
Big Data Analytics Hadoop and Spark Shelly Garion, Ph.D. IBM Research Haifa 1 What is Big Data? 2 What is Big Data? Big data usually includes data sets with sizes beyond the ability of commonly used software
Load-Balancing the Distance Computations in Record Linkage
Load-Balancing the Distance Computations in Record Linkage Dimitrios Karapiperis Vassilios S. Verykios Hellenic Open University School of Science and Technology Patras, Greece {dkarapiperis, verykios}@eap.gr
Analysis and Optimization of Massive Data Processing on High Performance Computing Architecture
Analysis and Optimization of Massive Data Processing on High Performance Computing Architecture He Huang, Shanshan Li, Xiaodong Yi, Feng Zhang, Xiangke Liao and Pan Dong School of Computer Science National
Research on Job Scheduling Algorithm in Hadoop
Journal of Computational Information Systems 7: 6 () 5769-5775 Available at http://www.jofcis.com Research on Job Scheduling Algorithm in Hadoop Yang XIA, Lei WANG, Qiang ZHAO, Gongxuan ZHANG School of
