# Statistical Learning Theory Meets Big Data

Save this PDF as:

Size: px
Start display at page:

## Transcription

1 Statistical Learning Theory Meets Big Data Randomized algorithms for frequent itemsets Eli Upfal Brown University

2 Data, data, data In God we trust, all others (must) bring data Prof. W.E. Deming, Statistician, (attributed) Every two days now we create as much information as we did from the dawn of civilization up until 2003 Eric Schmidt, Google Exec. Chairman, 2009 Data explosion is bigger than Moore's Law Marissa Mayer, Yahoo! CEO, 2009

3 Big Data Every two days now we create as much information as we did from the dawn of civilization up until 2003 Eric Schmidt, Google Exec. Chairman, 2009 Data explosion is bigger than Moore's Law Marissa Mayer, Yahoo! CEO, 2009

4 Big Data Big data requires big storage and long execution time It's great to have big data but do we need to process all of it? Trade off accuracy of results for execution speed Can approximation algorithms for data analysis be fast and still guarantee high-quality results?

5 Frequent Itemsets Market Basket Analysis You own a grocery store Have copy of receipts from sales For each customer, you have the list of products she bought Interested in what groups of products are bought together correlated items Business decisions Optimization

6 Settings Transactional dataset D Figure from Tan et al. - Introduction to Data Mining

7 Transactions Transactional dataset D transaction Figure from Tan et al. - Introduction to Data Mining

8 Items Transactions are built on items from a domain items Figure from Tan et al. - Introduction to Data Mining

9 Itemsets Sets of items are called itemsets Itemsets Figure from Tan et al. - Introduction to Data Mining

10 Frequency Frequency of an itemset X in dataset D: : fraction of transactions of D containing X Example: Milk: 4/5, {Bread, Milk}: 3/5 Figure from Tan et al. - Introduction to Data Mining

11 Frequent Itemsets Frequent Itemsets mining with respect to Find all itemsets X with frequency

12 Frequent Itemsets Frequent Itemsets mining with respect to Find all itemsets X with frequency Top -K Frequent Itemsets Find all itemsets at least as frequent as the kth most frequent Association Rules X in a transaction is correlated with Y in the transaction

13 Frequent Itemsets mining Classical problem in data analysis There are algorithms to extract exact collection of FI's APriori [AS94], FPGrowth [HPY00], Eclat [Zaki00]

14 Frequent Itemsets mining Well studied classical problem in data analysis There are algorithms to extract exact collection of FI's APriori [AS94], FPGrowth [HPY00], Eclat [Zaki00] Running time depends on number of transactions in the dataset and on the number of frequent itemsets 10^8 transactions considered normal size 10^4 items 2^(10^4) itemsets total A transaction with d items contains 2^d itemsets

15 How to speed up mining? Decrease dependency on number of transactions:

16 How to speed up mining? Decrease dependency on number of transactions: 1.create random sample of dataset 2.extract frequent itemsets from sample with lower

17 How to speed up mining? Decrease dependency on number of transactions: 1.create random sample of dataset 2.extract frequent itemsets from sample with lower Sampling force us to accept approximate results Need to quantify what is acceptable

18 How to speed up mining? Decrease dependency on number of transactions: 1.create random sample of dataset 2.extract frequent itemsets from sample with lower Sampling force us to accept approximate results Need to quantify what is acceptable Given, extract collection W of FI's such that: Frequency 0 1

19 How to speed up mining? Decrease dependency on number of transactions: 1.create random sample of dataset 2.extract frequent itemsets from sample with lower Sampling force us to accept approximate results Need to quantify what is acceptable Given, extract collection W of FI's such that: Frequency 0 1 Must be in W

20 How to speed up mining? Decrease dependency on number of transactions: 1.create random sample of dataset 2.extract frequent itemsets from sample with lower Sampling force us to accept approximate results Need to quantify what is acceptable Given, extract collection W of FI's such that: Frequency 0 1 Must not be in W Must be in W

21 How to speed up mining? Decrease dependency on number of transactions: 1.create random sample of dataset 2.extract frequent itemsets from sample with lower Sampling force us to accept approximate results Need to quantify what is acceptable Given, extract collection W of FI's such that: Frequency 0 1 Must not be in W May be in W Must be in W

22 Observa(on Data mining is exploratory in nature Exact results are usually not needed We want a good, realistic idea of the process Fast analysis is more important... if it still guarantees good-enough results! 22 Riondato et al., PARMA, CIKM'12

23 How to speed up mining? Decrease dependency on number of transactions: 1.create random sample of dataset 2.extract frequent itemsets from sample with lower Sampling force us to accept approximate results Need to quantify what is acceptable Given, extract collection W of itemset from sample, such that: Frequency 0 1 Must not be in W May be in W Must be in W Problem: choose right sample size and right

24 Choosing If we have, for all itemsets X simultaneously then does the trick

25 Choosing If we have, for all itemsets X simultaneously then does the trick Frequency 0 1 Must not be in W May be in W Must be in W

26 Choosing If we have, for all itemsets X simultaneously then does the trick Frequency 0 1 Must not be in W May be in W Must be in W Need sample size S s.t., with prob. at least, for all itemsets X, we have

27 Choosing the sample size Naïve approach Given itemset X, frequency of X in sample S is distributed like a binomial: Use Chernoff bound to bound Apply Union bound over all itemsets to find S such that

28 Choosing the sample size Naïve approach Given itemset X, frequency of X in sample S is distributed like a binomial: Use Chernoff bound to bound Apply Union bound over all itemsets to find S such that Problem: There's an exponential number of itemsets Sample size would too large (linear the in number of elements)

29 Vapnik-Chervonenkis Dimension Combinatorial property of a collection of subsets from a domain Measures the richness or expressivity of the collection of subsets If we know the VC-dim of a collection of subsets, we can compute the minimum sample size needed to approximate the sizes of all the subsets using the sample

30 Range spaces VC-Dimension is defined on range spaces (B,R): range space B: domain R: collection of subsets from B (ranges) No restrictions: B can be infinite R can be infinite R can contain infinitely-large subsets of B

31 Vapnik-Chervonenkis Dimension Range space For any, define C is shattered if The VC-Dimension of (B,R) is the size of the largest shattered subset of B

32 Example of VC-Dimension y B = x

33 Example of VC-Dimension y B = R = all axis-aligned rectangles in x

34 Example of VC-Dimension y B = R = all axis-aligned rectangles in x

35 Example of VC-Dimension y B = R = all axis-aligned rectangles in x Shattering 4 points: Easy Take any 4 points s.t. no 3 of them are aligned

36 Example of VC-Dimension y B = R = all axis-aligned rectangles in x Shattering 4 points: Easy Take any 4 points s.t. no 3 of them are aligned y x

37 Example of VC-Dimension y B = R = all axis-aligned rectangles in x Shattering 4 points: Easy Take any 4 points s.t. no 3 of them are aligned y Need 16 rectangles x

38 Example of VC-Dimension y B = R = all axis-aligned rectangles in x Shattering 5 points?

39 Example of VC-Dimension y B = R = all axis-aligned rectangles in x Shattering 5 points: impossible

40 Example of VC-Dimension y B = R = all axis-aligned rectangles in x Shattering 5 points: impossible Take any 5 points One of them that is contained in all rectangles containing the other four Impossible to find a rectangle containing only the other four VC(B,R)=4

41 Approximating sizes of ranges Sample Theorem: Let (B,R) have VC(B,R) d. Given, let S be a collection of points from B sampled uniformly at random. If then, Can approximate sizes of all simultaneously No need of union bound

42 VC-Dimension for Frequent Itemsets B = dataset D (set of transactions)

43 VC-Dimension for Frequent Itemsets B = dataset D (set of transactions) Itemset X, = transactions of D containing X

44 VC-Dimension for Frequent Itemsets B = dataset D (set of transactions) Itemset X, = transactions of D containing X

45 VC-Dimension for Frequent Itemsets B = dataset D (set of transactions) Itemset X, = transactions of D containing X

46 VC-Dimension for Frequent Itemsets B = dataset D (set of transactions) Itemset X, = transactions of D containing X

47 VC-Dimension for Frequent Itemsets B = dataset D (set of transactions) Itemset X, = transactions of D containing X To approximate using the sample theorem, need upper bound to VC(B,R)

48 Upper Bound to VC-Dim for FI's B = dataset D (set of transactions) Itemset X, = transactions of D containing X Theorem: [The d-index of a dataset] Let d be the maximum integer such that D contains at least d transactions of length at least d. Then VC(B,R) d

49 The d- index of the dataset d-index d: the maximum integer such that D contains at least d transactions of length at least d Example: this dataset has d-index d=3 bread beer milk coffee chips coke pasta bread coke chips milk coffee pasta milk Independent of D Can be computed with a single scan of the dataset d is an upper bound to the VC-dimension 49

50 Choosing the right sample size Theorem Let Let D be a dataset and d be the max integer such that D contains at least d transactions of length at least d Let S be a collection of transactions of D sampled independently and uniformly at random, with Then

51 Algorithm To extract approximate collection of freq itemsets: 1.Compute d for the dataset D 2.Compute sample size (function of d) 3.Create random sample S from D 4.Extract Freq Itemsets from S using

52 Algorithm To extract approximate collection of Freq. Itemsets: 1.Compute d for the dataset D 2.Compute sample size given d 3.Create random sample S from D 4.Extract Frequent Itemsets from S using Theorem: With probability at least, the returned set W of itemsets satisfies the desired property Frequency 0 1 Must not be in W May be in W Must be in W

53 A closer look... Expression of sample size:

54 A closer look... Expression of sample size: d: the maximum integer such that D contains at least d transactions of length at least d

55 A closer look... Expression of sample size: d: the maximum integer such that D contains at least d transactions of length at least d does not depend on D does not depend on does not depend on number of itemsets

56 A closer look... Expression of sample size: d: the maximum integer such that D contains at least d transactions of length at least d does not depend on D does not depend on does not depend on number of itemsets The time to mine S does not depend on D!!!

57 Proof (Intuition) For a set of k transactions to be shattered, each transaction must appear in 2^(k-1) different 's where X is an itemset A transaction only appears in the 's of the itemsets X it contains A transaction of length w contains 2^(w)-1 itemsets Need w k for the transaction to belong to a shattered set of size k To shatter k transactions they must all have length k Max k for which it happens is upper bound to VC-Dim

58 Experiments Sample always fits into main memory Output always satisfies required approx. guarantees Frequency accuracy even better than guaranteed Mining time significantly improved

59 Conclusions We showed how VC-dimension, a highly theoretical concept from statistical learning theory, can help in developing an efficient algorithm to approximate the collection of frequent itemsets, addressing one of the Big Data challenges through samp Riondato & Upfal: Efficient Discovery of Association Rules and Frequent Itemsets through Sampling with Tight Performance Guarantees. ECML/PKDD (1)

60 PARMA: A Randomized Algorithm for Approximate Associa;on Rules Mining in MapReduce Matteo Riondato, Justin A. DeBrabant, Rodrigo Fonseca, Eli Upfal

61 Goals Extract high quality approximation of FI(D,θ) by mining several small samples of D in parallel Adapt to available computational resources Minimize data replication and communication Achieve high parallelism 61 Riondato et al., PARMA, CIKM'12

62 Why parallelizing? For a very high quality approximation, the sample would still be large Sample creation time depends on size of dataset Parallelization allows us to achieve: higher accuracy (lower ε) higher confidence (lower δ) faster results Makes sense when dataset is stored on a distributed filesystem 62 Riondato et al., PARMA, CIKM'12

63 MapReduce framework for easy parallelization on large clusters of commodity machines Allows very fast analysis of large datasets Algorithms expressed through two functions: Map: (k,v) (k1,v1),(k2,v2),... Reduce: (r,(w1,w2,...)) (k1,v1),(k2,v2),... It is expensive to move data around (shuffling) Design goal: avoid data replication 63 Riondato et al., PARMA, CIKM'12

64 The op'miza'on framework Sample should fit into local memory No more samples than number of processors Exploit available computational resources to maximize quality of approximation Constraints and Objective function expressed in a Mixed Integer Non Linear Program Convex feasibility region and convex obj. function Easy and fast to solve with global MINLP solver 64 Riondato et al., PARMA, CIKM'12

65 The Algorithm: PARMA Very simple design 2 MapReduce Rounds 1 st round Map: create samples Reduce: mine each sample 2 nd round Map: identity function Reduce: aggregation/filtering of results 65 Riondato et al., PARMA, CIKM'12

66 1 st Round: Sampling and Mining Parallelization of the sequential algorithm Map: Sample Creation Random sampling with replacement... in parallel! Input: (transaction id, transaction) Output: (sample id, transaction) Reduce: Mining of each sample in parallel using lowered frequency threshold θ-ε/2 Can use any sequential exact mining algorithm (APriori, FPGrowth,...) Input: (sample id, (transaction1, transaction2,...)) Output: (itemset, (frequency, confidence interval)) 66 Riondato et al., PARMA, CIKM'12

67 2 nd Phase: Aggrega:on Map: Identity function Reduce: Filtering and Aggregation Input: (itemset,((freq1,conf_int1),(freq2,conf_int2),...) Output: (itemset, (frequency, confidence interval)) An itemset is sent in output only if it was frequent in a sufficient number of samples (determined by the optimization framework) Confidence interval: smallest interval containing a sufficient number of estimations freq1,freq2,... Frequency: median of confidence interval 67 Riondato et al., PARMA, CIKM'12

68 Analysis Theorem: With probability at least 1-δ, output is an ε-approximation to FI(D,θ) Proof intuition: Boosting-like approach Outputs of the reducers in stage 1 are ε-approximations for the local samples, with some probability 1-φ If an itemset appears in enough ε-approximations, then it is probably frequent in the dataset Consequences: better approximation, fewer false positives 68 Riondato et al., PARMA, CIKM'12

69 Implementa:on PARMA implemented as Java library for Hadoop Easy to run, possible integration with Apache Mahout FP-Growth used for the sample mining phase PARMA is agnostic to the choice of mining algorithm We used an existing Java implementation 1400 lines of code Available on github 69 Riondato et al., PARMA, CIKM'12

70 Experiments Run on Amazon Elastic MapReduce Up to 16 machines, with 17GB of RAM Artificial datasets using standard benchmark generators Various parameters (transaction size, items,...) Size between 5 to 50 million transactions Compared performances of PARMA with PFP and naive distributed counting PARMA performs significantly better (1.5x to 5x speed improvement) in all tests 70 Riondato et al., PARMA, CIKM'12

71 Run-me Analysis Runtime does not grow with size of data Because individual sample size does not grow! Much faster than PFP, faster as data grows! PARMA gets faster with more machines, even with more data. 71 Riondato et al., PARMA, CIKM'12

72 Speedup Quasi linear speedup Stage 1 very close to optimal Stage 2 no speedup Running time depends on number of itemsets No way to control the number of itemsets Common in frequent ideal itemsets mining algorithms 7 total Stage 1 Stage 2 speedup Riondato et al., PARMA, CIKM' number of instances 72

73 Accuracy Output always ε-approximation to FI(D,θ) Output not much larger than real collection Better frequency accuracy than guaranteed Smaller confidence intervals than guaranteed Riondato et al., PARMA, CIKM'12

74 Conclusions We showed how to efficiently use MapReduce to mine a high-quality approximation of the frequent itemsets and association rules from a large dataset with minimum data replication and quasi-linear speedup 74 Riondato et al., PARMA, CIKM'12

75 Publications Riondato, Upfal. Efficient Discovery of Association Rules and Frequent Itemsets through Sampling with Tight Performance Guarantees. ECML PKDD 2012 Riondato, DeBrabant, Fonseca, Upfal. PARMA: A Parallel Randomized Algorithm for Approximate Association Rules Mining in MapReduce. CIKM 2012

### PARMA: A Parallel Randomized Algorithm for Approximate Association Rules Mining in MapReduce

PARMA: A Parallel Randomized Algorithm for Approximate Association Rules Mining in MapReduce Matteo Riondato Dept. of Computer Science Brown University Providence, RI, USA matteo@cs.brown.edu Justin A.

### MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research

MapReduce and Distributed Data Analysis Google Research 1 Dealing With Massive Data 2 2 Dealing With Massive Data Polynomial Memory Sublinear RAM Sketches External Memory Property Testing 3 3 Dealing With

### MATTEO RIONDATO Curriculum vitae

MATTEO RIONDATO Curriculum vitae 100 Avenue of the Americas, 16 th Fl. New York, NY 10013, USA +1 646 292 6641 riondato@acm.org http://matteo.rionda.to EDUCATION Ph.D. Computer Science, Brown University,

### Analytics on Big Data

Analytics on Big Data Riccardo Torlone Università Roma Tre Credits: Mohamed Eltabakh (WPI) Analytics The discovery and communication of meaningful patterns in data (Wikipedia) It relies on data analysis

### Abstract of Sampling-based Randomized Algorithms for Big Data Analytics by Matteo Riondato,

Abstract of Sampling-based Randomized Algorithms for Big Data Analytics by Matteo Riondato, Ph.D., Brown University, May 2014. Analyzing huge datasets becomes prohibitively slow when the dataset does not

### Performance Analysis of Apriori Algorithm with Different Data Structures on Hadoop Cluster

Performance Analysis of Apriori Algorithm with Different Data Structures on Hadoop Cluster Sudhakar Singh Dept. of Computer Science Faculty of Science Banaras Hindu University Rakhi Garg Dept. of Computer

### Distributed Apriori in Hadoop MapReduce Framework

Distributed Apriori in Hadoop MapReduce Framework By Shulei Zhao (sz2352) and Rongxin Du (rd2537) Individual Contribution: Shulei Zhao: Implements centralized Apriori algorithm and input preprocessing

### Laboratory Module 8 Mining Frequent Itemsets Apriori Algorithm

Laboratory Module 8 Mining Frequent Itemsets Apriori Algorithm Purpose: key concepts in mining frequent itemsets understand the Apriori algorithm run Apriori in Weka GUI and in programatic way 1 Theoretical

### PREDICTIVE MODELING OF INTER-TRANSACTION ASSOCIATION RULES A BUSINESS PERSPECTIVE

International Journal of Computer Science and Applications, Vol. 5, No. 4, pp 57-69, 2008 Technomathematics Research Foundation PREDICTIVE MODELING OF INTER-TRANSACTION ASSOCIATION RULES A BUSINESS PERSPECTIVE

### B490 Mining the Big Data. 0 Introduction

B490 Mining the Big Data 0 Introduction Qin Zhang 1-1 Data Mining What is Data Mining? A definition : Discovery of useful, possibly unexpected, patterns in data. 2-1 Data Mining What is Data Mining? A

### CSE-E5430 Scalable Cloud Computing Lecture 2

CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 14.9-2015 1/36 Google MapReduce A scalable batch processing

### Scalable Cloud Computing Solutions for Next Generation Sequencing Data

Scalable Cloud Computing Solutions for Next Generation Sequencing Data Matti Niemenmaa 1, Aleksi Kallio 2, André Schumacher 1, Petri Klemelä 2, Eija Korpelainen 2, and Keijo Heljanko 1 1 Department of

### Association Rules Apriori Algorithm. Machine Learning Overview Sales Transaction and Association Rules Aprori Algorithm Example

Association Rules Apriori Algorithm Machine Learning Overview Sales Transaction and Association Rules Aprori Algorithm Example 1 Machine Learning Common ground of presented methods Statistical Learning

### Clustering Algorithms. Data Mining Clustering. Distance. Example. More Than One Mean. Mean Clustering

Clustering Algorithms Data Mining Clustering Kevin Swingler Organise data into a number of distinct groups (clusters) according to the similarity of their members and their differences from other clusters

### Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give

### Introduction to Big Data Analytics p. 1 Big Data Overview p. 2 Data Structures p. 5 Analyst Perspective on Data Repositories p.

Introduction p. xvii Introduction to Big Data Analytics p. 1 Big Data Overview p. 2 Data Structures p. 5 Analyst Perspective on Data Repositories p. 9 State of the Practice in Analytics p. 11 BI Versus

### Apriori-Map/Reduce Algorithm

Apriori-Map/Reduce Algorithm Jongwook Woo Computer Information Systems Department California State University Los Angeles, CA Abstract Map/Reduce algorithm has received highlights as cloud computing services

### 16.1 MAPREDUCE. For personal use only, not for distribution. 333

For personal use only, not for distribution. 333 16.1 MAPREDUCE Initially designed by the Google labs and used internally by Google, the MAPREDUCE distributed programming model is now promoted by several

### Mammoth Scale Machine Learning!

Mammoth Scale Machine Learning! Speaker: Robin Anil, Apache Mahout PMC Member! OSCON"10! Portland, OR! July 2010! Quick Show of Hands!# Are you fascinated about ML?!# Have you used ML?!# Do you have Gigabytes

### Interactive Analytical Processing in Big Data Systems,BDGS: AMay Scalable 23, 2014 Big Data1 Generat / 20

Interactive Analytical Processing in Big Data Systems,BDGS: A Scalable Big Data Generator Suite in Big Data Benchmarking,Study about DataSet May 23, 2014 Interactive Analytical Processing in Big Data Systems,BDGS:

### Introduction to Hadoop and MapReduce

Introduction to Hadoop and MapReduce THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION Large-scale Computation Traditional solutions for computing large quantities of data

### Machine Learning using MapReduce

Machine Learning using MapReduce What is Machine Learning Machine learning is a subfield of artificial intelligence concerned with techniques that allow computers to improve their outputs based on previous

### Big Data & Scripting Part II Streaming Algorithms

Big Data & Scripting Part II Streaming Algorithms 1, 2, a note on sampling and filtering sampling: (randomly) choose a representative subset filtering: given some criterion (e.g. membership in a set),

### A Study on Workload Imbalance Issues in Data Intensive Distributed Computing

A Study on Workload Imbalance Issues in Data Intensive Distributed Computing Sven Groot 1, Kazuo Goda 1, and Masaru Kitsuregawa 1 University of Tokyo, 4-6-1 Komaba, Meguro-ku, Tokyo 153-8505, Japan Abstract.

### Using Data Mining and Machine Learning in Retail

Using Data Mining and Machine Learning in Retail Omeid Seide Senior Manager, Big Data Solutions Sears Holdings Bharat Prasad Big Data Solution Architect Sears Holdings Over a Century of Innovation A Fortune

### Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia

Monitis Project Proposals for AUA September 2014, Yerevan, Armenia Distributed Log Collecting and Analysing Platform Project Specifications Category: Big Data and NoSQL Software Requirements: Apache Hadoop

### Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components of Hadoop. We will see what types of nodes can exist in a Hadoop

### Load Balancing in MapReduce Based on Scalable Cardinality Estimates

Load Balancing in MapReduce Based on Scalable Cardinality Estimates Benjamin Gufler 1, Nikolaus Augsten #, Angelika Reiser 3, Alfons Kemper 4 Technische Universität München Boltzmannstraße 3, 85748 Garching

Characterizing Task Usage Shapes in Google s Compute Clusters Qi Zhang 1, Joseph L. Hellerstein 2, Raouf Boutaba 1 1 University of Waterloo, 2 Google Inc. Introduction Cloud computing is becoming a key

### KNIME TUTORIAL. Anna Monreale KDD-Lab, University of Pisa Email: annam@di.unipi.it

KNIME TUTORIAL Anna Monreale KDD-Lab, University of Pisa Email: annam@di.unipi.it Outline Introduction on KNIME KNIME components Exercise: Market Basket Analysis Exercise: Customer Segmentation Exercise:

### Chapter 7. Using Hadoop Cluster and MapReduce

Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in

### R at the front end and

Divide & Recombine for Large Complex Data (a.k.a. Big Data) 1 Statistical framework requiring research in statistical theory and methods to make it work optimally Framework is designed to make computation

### Joint Optimization of Overlapping Phases in MapReduce

Joint Optimization of Overlapping Phases in MapReduce Minghong Lin, Li Zhang, Adam Wierman, Jian Tan Abstract MapReduce is a scalable parallel computing framework for big data processing. It exhibits multiple

### Management & Analysis of Big Data in Zenith Team

Management & Analysis of Big Data in Zenith Team Zenith Team, INRIA & LIRMM Outline Introduction to MapReduce Dealing with Data Skew in Big Data Processing Data Partitioning for MapReduce Frequent Sequence

### Data Mining Applications in Manufacturing

Data Mining Applications in Manufacturing Dr Jenny Harding Senior Lecturer Wolfson School of Mechanical & Manufacturing Engineering, Loughborough University Identification of Knowledge - Context Intelligent

### Chapter 6: Episode discovery process

Chapter 6: Episode discovery process Algorithmic Methods of Data Mining, Fall 2005, Chapter 6: Episode discovery process 1 6. Episode discovery process The knowledge discovery process KDD process of analyzing

### Big Data with Rough Set Using Map- Reduce

Big Data with Rough Set Using Map- Reduce Mr.G.Lenin 1, Mr. A. Raj Ganesh 2, Mr. S. Vanarasan 3 Assistant Professor, Department of CSE, Podhigai College of Engineering & Technology, Tirupattur, Tamilnadu,

### MINING THE DATA FROM DISTRIBUTED DATABASE USING AN IMPROVED MINING ALGORITHM

MINING THE DATA FROM DISTRIBUTED DATABASE USING AN IMPROVED MINING ALGORITHM J. Arokia Renjit Asst. Professor/ CSE Department, Jeppiaar Engineering College, Chennai, TamilNadu,India 600119. Dr.K.L.Shunmuganathan

### Selection of Optimal Discount of Retail Assortments with Data Mining Approach

Available online at www.interscience.in Selection of Optimal Discount of Retail Assortments with Data Mining Approach Padmalatha Eddla, Ravinder Reddy, Mamatha Computer Science Department,CBIT, Gandipet,Hyderabad,A.P,India.

### Mining an Online Auctions Data Warehouse

Proceedings of MASPLAS'02 The Mid-Atlantic Student Workshop on Programming Languages and Systems Pace University, April 19, 2002 Mining an Online Auctions Data Warehouse David Ulmer Under the guidance

### Knowledge Discovery and Data Mining. Structured vs. Non-Structured Data

Knowledge Discovery and Data Mining Unit # 2 1 Structured vs. Non-Structured Data Most business databases contain structured data consisting of well-defined fields with numeric or alphanumeric values.

### Case study: d60 Raptor smartadvisor. Jan Neerbek Alexandra Institute

Case study: d60 Raptor smartadvisor Jan Neerbek Alexandra Institute Agenda d60: A cloud/data mining case Cloud Data Mining Market Basket Analysis Large data sets Our solution 2 Alexandra Institute The

### Developing MapReduce Programs

Cloud Computing Developing MapReduce Programs Dell Zhang Birkbeck, University of London 2015/16 MapReduce Algorithm Design MapReduce: Recap Programmers must specify two functions: map (k, v) * Takes

### Data Warehousing und Data Mining

Data Warehousing und Data Mining Multidimensionale Indexstrukturen Ulf Leser Wissensmanagement in der Bioinformatik Content of this Lecture Multidimensional Indexing Grid-Files Kd-trees Ulf Leser: Data

### A Novel Cloud Based Elastic Framework for Big Data Preprocessing

School of Systems Engineering A Novel Cloud Based Elastic Framework for Big Data Preprocessing Omer Dawelbeit and Rachel McCrindle October 21, 2014 University of Reading 2008 www.reading.ac.uk Overview

### IMCM: A Flexible Fine-Grained Adaptive Framework for Parallel Mobile Hybrid Cloud Applications

Open System Laboratory of University of Illinois at Urbana Champaign presents: Outline: IMCM: A Flexible Fine-Grained Adaptive Framework for Parallel Mobile Hybrid Cloud Applications A Fine-Grained Adaptive

### Introduction to DISC and Hadoop

Introduction to DISC and Hadoop Alice E. Fischer April 24, 2009 Alice E. Fischer DISC... 1/20 1 2 History Hadoop provides a three-layer paradigm Alice E. Fischer DISC... 2/20 Parallel Computing Past and

### Big Data and Data Science: Behind the Buzz Words

Big Data and Data Science: Behind the Buzz Words Peggy Brinkmann, FCAS, MAAA Actuary Milliman, Inc. April 1, 2014 Contents Big data: from hype to value Deconstructing data science Managing big data Analyzing

### A Brief Introduction to Apache Tez

A Brief Introduction to Apache Tez Introduction It is a fact that data is basically the new currency of the modern business world. Companies that effectively maximize the value of their data (extract value

### Improving Apriori Algorithm to get better performance with Cloud Computing

Improving Apriori Algorithm to get better performance with Cloud Computing Zeba Qureshi 1 ; Sanjay Bansal 2 Affiliation: A.I.T.R, RGPV, India 1, A.I.T.R, RGPV, India 2 ABSTRACT Cloud computing has become

### Mining Interesting Medical Knowledge from Big Data

IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661,p-ISSN: 2278-8727, Volume 18, Issue 1, Ver. II (Jan Feb. 2016), PP 06-10 www.iosrjournals.org Mining Interesting Medical Knowledge from

### Large-Scale Test Mining

Large-Scale Test Mining SIAM Conference on Data Mining Text Mining 2010 Alan Ratner Northrop Grumman Information Systems NORTHROP GRUMMAN PRIVATE / PROPRIETARY LEVEL I Aim Identify topic and language/script/coding

### Big Data and Analytics: A Conceptual Overview. Mike Park Erik Hoel

Big Data and Analytics: A Conceptual Overview Mike Park Erik Hoel In this technical workshop This presentation is for anyone that uses ArcGIS and is interested in analyzing large amounts of data We will

### Hadoop Ecosystem B Y R A H I M A.

Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open

### Chapter 20: Data Analysis

Chapter 20: Data Analysis Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Chapter 20: Data Analysis Decision Support Systems Data Warehousing Data Mining Classification

### Project Report. 1. Application Scenario

Project Report In this report, we briefly introduce the application scenario of association rule mining, give details of apriori algorithm implementation and comment on the mined rules. Also some instructions

Hadoop-BAM and SeqPig Keijo Heljanko 1, André Schumacher 1,2, Ridvan Döngelci 1, Luca Pireddu 3, Matti Niemenmaa 1, Aleksi Kallio 4, Eija Korpelainen 4, and Gianluigi Zanetti 3 1 Department of Computer

### GraySort on Apache Spark by Databricks

GraySort on Apache Spark by Databricks Reynold Xin, Parviz Deyhim, Ali Ghodsi, Xiangrui Meng, Matei Zaharia Databricks Inc. Apache Spark Sorting in Spark Overview Sorting Within a Partition Range Partitioner

### CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model

Hadoop Architecture Part 1 Node, Rack and Cluster: A node is simply a computer, typically non-enterprise, commodity hardware for nodes that contain data. Consider we have Node 1.Then we can add more nodes,

### Mining Large Datasets: Case of Mining Graph Data in the Cloud

Mining Large Datasets: Case of Mining Graph Data in the Cloud Sabeur Aridhi PhD in Computer Science with Laurent d Orazio, Mondher Maddouri and Engelbert Mephu Nguifo 16/05/2014 Sabeur Aridhi Mining Large

### Exploring HADOOP as a Platform for Distributed Association Rule Mining

FUTURE COMPUTING 2013 : The Fifth International Conference on Future Computational Technologies and Applications Exploring HADOOP as a Platform for Distributed Association Rule Mining Shravanth Oruganti,

### Optimization and analysis of large scale data sorting algorithm based on Hadoop

Optimization and analysis of large scale sorting algorithm based on Hadoop Zhuo Wang, Longlong Tian, Dianjie Guo, Xiaoming Jiang Institute of Information Engineering, Chinese Academy of Sciences {wangzhuo,

### Frequent Itemset Mining for Big Data

Frequent Itemset Mining for Big Data Sandy Moens, Emin Aksehirli and Bart Goethals Universiteit Antwerpen, Belgium Email: firstname.lastname@uantwerpen.be Abstract Frequent Itemset Mining (FIM) is one

Open source Google-style large scale data analysis with Hadoop Ioannis Konstantinou Email: ikons@cslab.ece.ntua.gr Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory School of Electrical

### BIG DATA What it is and how to use?

BIG DATA What it is and how to use? Lauri Ilison, PhD Data Scientist 21.11.2014 Big Data definition? There is no clear definition for BIG DATA BIG DATA is more of a concept than precise term 1 21.11.14

### New Matrix Approach to Improve Apriori Algorithm

New Matrix Approach to Improve Apriori Algorithm A. Rehab H. Alwa, B. Anasuya V Patil Associate Prof., IT Faculty, Majan College-University College Muscat, Oman, rehab.alwan@majancolleg.edu.om Associate

### A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM Sneha D.Borkar 1, Prof.Chaitali S.Surtakar 2 Student of B.E., Information Technology, J.D.I.E.T, sborkar95@gmail.com Assistant Professor, Information

### A Performance Analysis of Distributed Indexing using Terrier

A Performance Analysis of Distributed Indexing using Terrier Amaury Couste Jakub Kozłowski William Martin Indexing Indexing Used by search

### The Performance Characteristics of MapReduce Applications on Scalable Clusters

The Performance Characteristics of MapReduce Applications on Scalable Clusters Kenneth Wottrich Denison University Granville, OH 43023 wottri_k1@denison.edu ABSTRACT Many cluster owners and operators have

June 2010 From: StatSoft Analytics White Papers To: Internal release Re: Performance comparison of STATISTICA Version 9 on multi-core 64-bit machines with current 64-bit releases of SAS (Version 9.2) and

### DWMiner : A tool for mining frequent item sets efficiently in data warehouses

DWMiner : A tool for mining frequent item sets efficiently in data warehouses Bruno Kinder Almentero, Alexandre Gonçalves Evsukoff and Marta Mattoso COPPE/Federal University of Rio de Janeiro, P.O.Box

### Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Parallel Databases Increase performance by performing operations in parallel Parallel Architectures Shared memory Shared disk Shared nothing closely coupled loosely coupled Parallelism Terminology Speedup:

Advanced In-Database Analytics Tallinn, Sept. 25th, 2012 Mikko-Pekka Bertling, BDM Greenplum EMEA 1 That sounds complicated? 2 Who can tell me how best to solve this 3 What are the main mathematical functions??

### On the Cost of Mining Very Large Open Source Repositories

On the Cost of Mining Very Large Open Source Repositories Sean Banerjee Carnegie Mellon University Bojan Cukic University of North Carolina at Charlotte BIGDSE, Florence 2015 Introduction Issue tracking

### MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

MapReduce and Hadoop Aaron Birkland Cornell Center for Advanced Computing January 2012 Motivation Simple programming model for Big Data Distributed, parallel but hides this Established success at petabyte

### An Overview of Knowledge Discovery Database and Data mining Techniques

An Overview of Knowledge Discovery Database and Data mining Techniques Priyadharsini.C 1, Dr. Antony Selvadoss Thanamani 2 M.Phil, Department of Computer Science, NGM College, Pollachi, Coimbatore, Tamilnadu,

### So, how do you pronounce. Jilles Vreeken. Okay, now we can talk. So, what kind of data? binary. * multi-relational

Simply Mining Data Jilles Vreeken So, how do you pronounce Exploratory Data Analysis Jilles Vreeken Jilles Yill less Vreeken Fray can 17 August 2015 Okay, now we can talk. 17 August 2015 The goal So, what

### Keywords Big Data; OODBMS; RDBMS; hadoop; EDM; learning analytics, data abundance.

Volume 4, Issue 11, November 2014 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Analytics

### CAS CS 565, Data Mining

CAS CS 565, Data Mining Course logistics Course webpage: http://www.cs.bu.edu/~evimaria/cs565-10.html Schedule: Mon Wed, 4-5:30 Instructor: Evimaria Terzi, evimaria@cs.bu.edu Office hours: Mon 2:30-4pm,

### Small Maximal Independent Sets and Faster Exact Graph Coloring

Small Maximal Independent Sets and Faster Exact Graph Coloring David Eppstein Univ. of California, Irvine Dept. of Information and Computer Science The Exact Graph Coloring Problem: Given an undirected

### Energy Efficient MapReduce

Energy Efficient MapReduce Motivation: Energy consumption is an important aspect of datacenters efficiency, the total power consumption in the united states has doubled from 2000 to 2005, representing

### Parallel Data Mining. Team 2 Flash Coders Team Research Investigation Presentation 2. Foundations of Parallel Computing Oct 2014

Parallel Data Mining Team 2 Flash Coders Team Research Investigation Presentation 2 Foundations of Parallel Computing Oct 2014 Agenda Overview of topic Analysis of research papers Software design Overview

### Statistics for BIG data

Statistics for BIG data Statistics for Big Data: Are Statisticians Ready? Dennis Lin Department of Statistics The Pennsylvania State University John Jordan and Dennis K.J. Lin (ICSA-Bulletine 2014) Before

### IBM SPSS Direct Marketing 23

IBM SPSS Direct Marketing 23 Note Before using this information and the product it supports, read the information in Notices on page 25. Product Information This edition applies to version 23, release

### Using In-Memory Computing to Simplify Big Data Analytics

SCALEOUT SOFTWARE Using In-Memory Computing to Simplify Big Data Analytics by Dr. William Bain, ScaleOut Software, Inc. 2012 ScaleOut Software, Inc. 12/27/2012 T he big data revolution is upon us, fed

### Evaluating HDFS I/O Performance on Virtualized Systems

Evaluating HDFS I/O Performance on Virtualized Systems Xin Tang xtang@cs.wisc.edu University of Wisconsin-Madison Department of Computer Sciences Abstract Hadoop as a Service (HaaS) has received increasing

### Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Hadoop MapReduce and Spark Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015 Outline Hadoop Hadoop Import data on Hadoop Spark Spark features Scala MLlib MLlib

### Market Basket Analysis Algorithm with Map/Reduce of Cloud Computing

Market Basket Analysis Algorithm with Map/Reduce of Cloud Computing Jongwook Woo Computer Information Systems Department California State University Los Angeles, CA Abstract Map/Reduce approach has been

### IMPLEMENTATION OF P-PIC ALGORITHM IN MAP REDUCE TO HANDLE BIG DATA

IMPLEMENTATION OF P-PIC ALGORITHM IN MAP REDUCE TO HANDLE BIG DATA Jayalatchumy D 1, Thambidurai. P 2 Abstract Clustering is a process of grouping objects that are similar among themselves but dissimilar

### Locality-Sensitive Operators for Parallel Main-Memory Database Clusters

Locality-Sensitive Operators for Parallel Main-Memory Database Clusters Wolf Rödiger, Tobias Mühlbauer, Philipp Unterbrunner*, Angelika Reiser, Alfons Kemper, Thomas Neumann Technische Universität München,

### Developing a MapReduce Application

TIE 12206 - Apache Hadoop Tampere University of Technology, Finland November, 2014 Outline 1 MapReduce Paradigm 2 Hadoop Default Ports 3 Outline 1 MapReduce Paradigm 2 Hadoop Default Ports 3 MapReduce

### Big Data Explained. An introduction to Big Data Science.

Big Data Explained An introduction to Big Data Science. 1 Presentation Agenda What is Big Data Why learn Big Data Who is it for How to start learning Big Data When to learn it Objective and Benefits of

### IBM SPSS Direct Marketing 22

IBM SPSS Direct Marketing 22 Note Before using this information and the product it supports, read the information in Notices on page 25. Product Information This edition applies to version 22, release

### MapReduce Algorithms. Sergei Vassilvitskii. Saturday, August 25, 12

MapReduce Algorithms A Sense of Scale At web scales... Mail: Billions of messages per day Search: Billions of searches per day Social: Billions of relationships 2 A Sense of Scale At web scales... Mail: