Statistical Learning Theory Meets Big Data

Save this PDF as:
 WORD  PNG  TXT  JPG

Size: px
Start display at page:

Download "Statistical Learning Theory Meets Big Data"

Transcription

1 Statistical Learning Theory Meets Big Data Randomized algorithms for frequent itemsets Eli Upfal Brown University

2 Data, data, data In God we trust, all others (must) bring data Prof. W.E. Deming, Statistician, (attributed) Every two days now we create as much information as we did from the dawn of civilization up until 2003 Eric Schmidt, Google Exec. Chairman, 2009 Data explosion is bigger than Moore's Law Marissa Mayer, Yahoo! CEO, 2009

3 Big Data Every two days now we create as much information as we did from the dawn of civilization up until 2003 Eric Schmidt, Google Exec. Chairman, 2009 Data explosion is bigger than Moore's Law Marissa Mayer, Yahoo! CEO, 2009

4 Big Data Big data requires big storage and long execution time It's great to have big data but do we need to process all of it? Trade off accuracy of results for execution speed Can approximation algorithms for data analysis be fast and still guarantee high-quality results?

5 Frequent Itemsets Market Basket Analysis You own a grocery store Have copy of receipts from sales For each customer, you have the list of products she bought Interested in what groups of products are bought together correlated items Business decisions Optimization

6 Settings Transactional dataset D Figure from Tan et al. - Introduction to Data Mining

7 Transactions Transactional dataset D transaction Figure from Tan et al. - Introduction to Data Mining

8 Items Transactions are built on items from a domain items Figure from Tan et al. - Introduction to Data Mining

9 Itemsets Sets of items are called itemsets Itemsets Figure from Tan et al. - Introduction to Data Mining

10 Frequency Frequency of an itemset X in dataset D: : fraction of transactions of D containing X Example: Milk: 4/5, {Bread, Milk}: 3/5 Figure from Tan et al. - Introduction to Data Mining

11 Frequent Itemsets Frequent Itemsets mining with respect to Find all itemsets X with frequency

12 Frequent Itemsets Frequent Itemsets mining with respect to Find all itemsets X with frequency Top -K Frequent Itemsets Find all itemsets at least as frequent as the kth most frequent Association Rules X in a transaction is correlated with Y in the transaction

13 Frequent Itemsets mining Classical problem in data analysis There are algorithms to extract exact collection of FI's APriori [AS94], FPGrowth [HPY00], Eclat [Zaki00]

14 Frequent Itemsets mining Well studied classical problem in data analysis There are algorithms to extract exact collection of FI's APriori [AS94], FPGrowth [HPY00], Eclat [Zaki00] Running time depends on number of transactions in the dataset and on the number of frequent itemsets 10^8 transactions considered normal size 10^4 items 2^(10^4) itemsets total A transaction with d items contains 2^d itemsets

15 How to speed up mining? Decrease dependency on number of transactions:

16 How to speed up mining? Decrease dependency on number of transactions: 1.create random sample of dataset 2.extract frequent itemsets from sample with lower

17 How to speed up mining? Decrease dependency on number of transactions: 1.create random sample of dataset 2.extract frequent itemsets from sample with lower Sampling force us to accept approximate results Need to quantify what is acceptable

18 How to speed up mining? Decrease dependency on number of transactions: 1.create random sample of dataset 2.extract frequent itemsets from sample with lower Sampling force us to accept approximate results Need to quantify what is acceptable Given, extract collection W of FI's such that: Frequency 0 1

19 How to speed up mining? Decrease dependency on number of transactions: 1.create random sample of dataset 2.extract frequent itemsets from sample with lower Sampling force us to accept approximate results Need to quantify what is acceptable Given, extract collection W of FI's such that: Frequency 0 1 Must be in W

20 How to speed up mining? Decrease dependency on number of transactions: 1.create random sample of dataset 2.extract frequent itemsets from sample with lower Sampling force us to accept approximate results Need to quantify what is acceptable Given, extract collection W of FI's such that: Frequency 0 1 Must not be in W Must be in W

21 How to speed up mining? Decrease dependency on number of transactions: 1.create random sample of dataset 2.extract frequent itemsets from sample with lower Sampling force us to accept approximate results Need to quantify what is acceptable Given, extract collection W of FI's such that: Frequency 0 1 Must not be in W May be in W Must be in W

22 Observa(on Data mining is exploratory in nature Exact results are usually not needed We want a good, realistic idea of the process Fast analysis is more important... if it still guarantees good-enough results! 22 Riondato et al., PARMA, CIKM'12

23 How to speed up mining? Decrease dependency on number of transactions: 1.create random sample of dataset 2.extract frequent itemsets from sample with lower Sampling force us to accept approximate results Need to quantify what is acceptable Given, extract collection W of itemset from sample, such that: Frequency 0 1 Must not be in W May be in W Must be in W Problem: choose right sample size and right

24 Choosing If we have, for all itemsets X simultaneously then does the trick

25 Choosing If we have, for all itemsets X simultaneously then does the trick Frequency 0 1 Must not be in W May be in W Must be in W

26 Choosing If we have, for all itemsets X simultaneously then does the trick Frequency 0 1 Must not be in W May be in W Must be in W Need sample size S s.t., with prob. at least, for all itemsets X, we have

27 Choosing the sample size Naïve approach Given itemset X, frequency of X in sample S is distributed like a binomial: Use Chernoff bound to bound Apply Union bound over all itemsets to find S such that

28 Choosing the sample size Naïve approach Given itemset X, frequency of X in sample S is distributed like a binomial: Use Chernoff bound to bound Apply Union bound over all itemsets to find S such that Problem: There's an exponential number of itemsets Sample size would too large (linear the in number of elements)

29 Vapnik-Chervonenkis Dimension Combinatorial property of a collection of subsets from a domain Measures the richness or expressivity of the collection of subsets If we know the VC-dim of a collection of subsets, we can compute the minimum sample size needed to approximate the sizes of all the subsets using the sample

30 Range spaces VC-Dimension is defined on range spaces (B,R): range space B: domain R: collection of subsets from B (ranges) No restrictions: B can be infinite R can be infinite R can contain infinitely-large subsets of B

31 Vapnik-Chervonenkis Dimension Range space For any, define C is shattered if The VC-Dimension of (B,R) is the size of the largest shattered subset of B

32 Example of VC-Dimension y B = x

33 Example of VC-Dimension y B = R = all axis-aligned rectangles in x

34 Example of VC-Dimension y B = R = all axis-aligned rectangles in x

35 Example of VC-Dimension y B = R = all axis-aligned rectangles in x Shattering 4 points: Easy Take any 4 points s.t. no 3 of them are aligned

36 Example of VC-Dimension y B = R = all axis-aligned rectangles in x Shattering 4 points: Easy Take any 4 points s.t. no 3 of them are aligned y x

37 Example of VC-Dimension y B = R = all axis-aligned rectangles in x Shattering 4 points: Easy Take any 4 points s.t. no 3 of them are aligned y Need 16 rectangles x

38 Example of VC-Dimension y B = R = all axis-aligned rectangles in x Shattering 5 points?

39 Example of VC-Dimension y B = R = all axis-aligned rectangles in x Shattering 5 points: impossible

40 Example of VC-Dimension y B = R = all axis-aligned rectangles in x Shattering 5 points: impossible Take any 5 points One of them that is contained in all rectangles containing the other four Impossible to find a rectangle containing only the other four VC(B,R)=4

41 Approximating sizes of ranges Sample Theorem: Let (B,R) have VC(B,R) d. Given, let S be a collection of points from B sampled uniformly at random. If then, Can approximate sizes of all simultaneously No need of union bound

42 VC-Dimension for Frequent Itemsets B = dataset D (set of transactions)

43 VC-Dimension for Frequent Itemsets B = dataset D (set of transactions) Itemset X, = transactions of D containing X

44 VC-Dimension for Frequent Itemsets B = dataset D (set of transactions) Itemset X, = transactions of D containing X

45 VC-Dimension for Frequent Itemsets B = dataset D (set of transactions) Itemset X, = transactions of D containing X

46 VC-Dimension for Frequent Itemsets B = dataset D (set of transactions) Itemset X, = transactions of D containing X

47 VC-Dimension for Frequent Itemsets B = dataset D (set of transactions) Itemset X, = transactions of D containing X To approximate using the sample theorem, need upper bound to VC(B,R)

48 Upper Bound to VC-Dim for FI's B = dataset D (set of transactions) Itemset X, = transactions of D containing X Theorem: [The d-index of a dataset] Let d be the maximum integer such that D contains at least d transactions of length at least d. Then VC(B,R) d

49 The d- index of the dataset d-index d: the maximum integer such that D contains at least d transactions of length at least d Example: this dataset has d-index d=3 bread beer milk coffee chips coke pasta bread coke chips milk coffee pasta milk Independent of D Can be computed with a single scan of the dataset d is an upper bound to the VC-dimension 49

50 Choosing the right sample size Theorem Let Let D be a dataset and d be the max integer such that D contains at least d transactions of length at least d Let S be a collection of transactions of D sampled independently and uniformly at random, with Then

51 Algorithm To extract approximate collection of freq itemsets: 1.Compute d for the dataset D 2.Compute sample size (function of d) 3.Create random sample S from D 4.Extract Freq Itemsets from S using

52 Algorithm To extract approximate collection of Freq. Itemsets: 1.Compute d for the dataset D 2.Compute sample size given d 3.Create random sample S from D 4.Extract Frequent Itemsets from S using Theorem: With probability at least, the returned set W of itemsets satisfies the desired property Frequency 0 1 Must not be in W May be in W Must be in W

53 A closer look... Expression of sample size:

54 A closer look... Expression of sample size: d: the maximum integer such that D contains at least d transactions of length at least d

55 A closer look... Expression of sample size: d: the maximum integer such that D contains at least d transactions of length at least d does not depend on D does not depend on does not depend on number of itemsets

56 A closer look... Expression of sample size: d: the maximum integer such that D contains at least d transactions of length at least d does not depend on D does not depend on does not depend on number of itemsets The time to mine S does not depend on D!!!

57 Proof (Intuition) For a set of k transactions to be shattered, each transaction must appear in 2^(k-1) different 's where X is an itemset A transaction only appears in the 's of the itemsets X it contains A transaction of length w contains 2^(w)-1 itemsets Need w k for the transaction to belong to a shattered set of size k To shatter k transactions they must all have length k Max k for which it happens is upper bound to VC-Dim

58 Experiments Sample always fits into main memory Output always satisfies required approx. guarantees Frequency accuracy even better than guaranteed Mining time significantly improved

59 Conclusions We showed how VC-dimension, a highly theoretical concept from statistical learning theory, can help in developing an efficient algorithm to approximate the collection of frequent itemsets, addressing one of the Big Data challenges through samp Riondato & Upfal: Efficient Discovery of Association Rules and Frequent Itemsets through Sampling with Tight Performance Guarantees. ECML/PKDD (1)

60 PARMA: A Randomized Algorithm for Approximate Associa;on Rules Mining in MapReduce Matteo Riondato, Justin A. DeBrabant, Rodrigo Fonseca, Eli Upfal

61 Goals Extract high quality approximation of FI(D,θ) by mining several small samples of D in parallel Adapt to available computational resources Minimize data replication and communication Achieve high parallelism 61 Riondato et al., PARMA, CIKM'12

62 Why parallelizing? For a very high quality approximation, the sample would still be large Sample creation time depends on size of dataset Parallelization allows us to achieve: higher accuracy (lower ε) higher confidence (lower δ) faster results Makes sense when dataset is stored on a distributed filesystem 62 Riondato et al., PARMA, CIKM'12

63 MapReduce framework for easy parallelization on large clusters of commodity machines Allows very fast analysis of large datasets Algorithms expressed through two functions: Map: (k,v) (k1,v1),(k2,v2),... Reduce: (r,(w1,w2,...)) (k1,v1),(k2,v2),... It is expensive to move data around (shuffling) Design goal: avoid data replication 63 Riondato et al., PARMA, CIKM'12

64 The op'miza'on framework Sample should fit into local memory No more samples than number of processors Exploit available computational resources to maximize quality of approximation Constraints and Objective function expressed in a Mixed Integer Non Linear Program Convex feasibility region and convex obj. function Easy and fast to solve with global MINLP solver 64 Riondato et al., PARMA, CIKM'12

65 The Algorithm: PARMA Very simple design 2 MapReduce Rounds 1 st round Map: create samples Reduce: mine each sample 2 nd round Map: identity function Reduce: aggregation/filtering of results 65 Riondato et al., PARMA, CIKM'12

66 1 st Round: Sampling and Mining Parallelization of the sequential algorithm Map: Sample Creation Random sampling with replacement... in parallel! Input: (transaction id, transaction) Output: (sample id, transaction) Reduce: Mining of each sample in parallel using lowered frequency threshold θ-ε/2 Can use any sequential exact mining algorithm (APriori, FPGrowth,...) Input: (sample id, (transaction1, transaction2,...)) Output: (itemset, (frequency, confidence interval)) 66 Riondato et al., PARMA, CIKM'12

67 2 nd Phase: Aggrega:on Map: Identity function Reduce: Filtering and Aggregation Input: (itemset,((freq1,conf_int1),(freq2,conf_int2),...) Output: (itemset, (frequency, confidence interval)) An itemset is sent in output only if it was frequent in a sufficient number of samples (determined by the optimization framework) Confidence interval: smallest interval containing a sufficient number of estimations freq1,freq2,... Frequency: median of confidence interval 67 Riondato et al., PARMA, CIKM'12

68 Analysis Theorem: With probability at least 1-δ, output is an ε-approximation to FI(D,θ) Proof intuition: Boosting-like approach Outputs of the reducers in stage 1 are ε-approximations for the local samples, with some probability 1-φ If an itemset appears in enough ε-approximations, then it is probably frequent in the dataset Consequences: better approximation, fewer false positives 68 Riondato et al., PARMA, CIKM'12

69 Implementa:on PARMA implemented as Java library for Hadoop Easy to run, possible integration with Apache Mahout FP-Growth used for the sample mining phase PARMA is agnostic to the choice of mining algorithm We used an existing Java implementation 1400 lines of code Available on github 69 Riondato et al., PARMA, CIKM'12

70 Experiments Run on Amazon Elastic MapReduce Up to 16 machines, with 17GB of RAM Artificial datasets using standard benchmark generators Various parameters (transaction size, items,...) Size between 5 to 50 million transactions Compared performances of PARMA with PFP and naive distributed counting PARMA performs significantly better (1.5x to 5x speed improvement) in all tests 70 Riondato et al., PARMA, CIKM'12

71 Run-me Analysis Runtime does not grow with size of data Because individual sample size does not grow! Much faster than PFP, faster as data grows! PARMA gets faster with more machines, even with more data. 71 Riondato et al., PARMA, CIKM'12

72 Speedup Quasi linear speedup Stage 1 very close to optimal Stage 2 no speedup Running time depends on number of itemsets No way to control the number of itemsets Common in frequent ideal itemsets mining algorithms 7 total Stage 1 Stage 2 speedup Riondato et al., PARMA, CIKM' number of instances 72

73 Accuracy Output always ε-approximation to FI(D,θ) Output not much larger than real collection Better frequency accuracy than guaranteed Smaller confidence intervals than guaranteed Riondato et al., PARMA, CIKM'12

74 Conclusions We showed how to efficiently use MapReduce to mine a high-quality approximation of the frequent itemsets and association rules from a large dataset with minimum data replication and quasi-linear speedup 74 Riondato et al., PARMA, CIKM'12

75 Publications Riondato, Upfal. Efficient Discovery of Association Rules and Frequent Itemsets through Sampling with Tight Performance Guarantees. ECML PKDD 2012 Riondato, DeBrabant, Fonseca, Upfal. PARMA: A Parallel Randomized Algorithm for Approximate Association Rules Mining in MapReduce. CIKM 2012

PARMA: A Parallel Randomized Algorithm for Approximate Association Rules Mining in MapReduce

PARMA: A Parallel Randomized Algorithm for Approximate Association Rules Mining in MapReduce PARMA: A Parallel Randomized Algorithm for Approximate Association Rules Mining in MapReduce Matteo Riondato Dept. of Computer Science Brown University Providence, RI, USA matteo@cs.brown.edu Justin A.

More information

MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research

MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research MapReduce and Distributed Data Analysis Google Research 1 Dealing With Massive Data 2 2 Dealing With Massive Data Polynomial Memory Sublinear RAM Sketches External Memory Property Testing 3 3 Dealing With

More information

MATTEO RIONDATO Curriculum vitae

MATTEO RIONDATO Curriculum vitae MATTEO RIONDATO Curriculum vitae 100 Avenue of the Americas, 16 th Fl. New York, NY 10013, USA +1 646 292 6641 riondato@acm.org http://matteo.rionda.to EDUCATION Ph.D. Computer Science, Brown University,

More information

Analytics on Big Data

Analytics on Big Data Analytics on Big Data Riccardo Torlone Università Roma Tre Credits: Mohamed Eltabakh (WPI) Analytics The discovery and communication of meaningful patterns in data (Wikipedia) It relies on data analysis

More information

Abstract of Sampling-based Randomized Algorithms for Big Data Analytics by Matteo Riondato,

Abstract of Sampling-based Randomized Algorithms for Big Data Analytics by Matteo Riondato, Abstract of Sampling-based Randomized Algorithms for Big Data Analytics by Matteo Riondato, Ph.D., Brown University, May 2014. Analyzing huge datasets becomes prohibitively slow when the dataset does not

More information

Performance Analysis of Apriori Algorithm with Different Data Structures on Hadoop Cluster

Performance Analysis of Apriori Algorithm with Different Data Structures on Hadoop Cluster Performance Analysis of Apriori Algorithm with Different Data Structures on Hadoop Cluster Sudhakar Singh Dept. of Computer Science Faculty of Science Banaras Hindu University Rakhi Garg Dept. of Computer

More information

Distributed Apriori in Hadoop MapReduce Framework

Distributed Apriori in Hadoop MapReduce Framework Distributed Apriori in Hadoop MapReduce Framework By Shulei Zhao (sz2352) and Rongxin Du (rd2537) Individual Contribution: Shulei Zhao: Implements centralized Apriori algorithm and input preprocessing

More information

Laboratory Module 8 Mining Frequent Itemsets Apriori Algorithm

Laboratory Module 8 Mining Frequent Itemsets Apriori Algorithm Laboratory Module 8 Mining Frequent Itemsets Apriori Algorithm Purpose: key concepts in mining frequent itemsets understand the Apriori algorithm run Apriori in Weka GUI and in programatic way 1 Theoretical

More information

PREDICTIVE MODELING OF INTER-TRANSACTION ASSOCIATION RULES A BUSINESS PERSPECTIVE

PREDICTIVE MODELING OF INTER-TRANSACTION ASSOCIATION RULES A BUSINESS PERSPECTIVE International Journal of Computer Science and Applications, Vol. 5, No. 4, pp 57-69, 2008 Technomathematics Research Foundation PREDICTIVE MODELING OF INTER-TRANSACTION ASSOCIATION RULES A BUSINESS PERSPECTIVE

More information

B490 Mining the Big Data. 0 Introduction

B490 Mining the Big Data. 0 Introduction B490 Mining the Big Data 0 Introduction Qin Zhang 1-1 Data Mining What is Data Mining? A definition : Discovery of useful, possibly unexpected, patterns in data. 2-1 Data Mining What is Data Mining? A

More information

CSE-E5430 Scalable Cloud Computing Lecture 2

CSE-E5430 Scalable Cloud Computing Lecture 2 CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 14.9-2015 1/36 Google MapReduce A scalable batch processing

More information

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

Scalable Cloud Computing Solutions for Next Generation Sequencing Data Scalable Cloud Computing Solutions for Next Generation Sequencing Data Matti Niemenmaa 1, Aleksi Kallio 2, André Schumacher 1, Petri Klemelä 2, Eija Korpelainen 2, and Keijo Heljanko 1 1 Department of

More information

Association Rules Apriori Algorithm. Machine Learning Overview Sales Transaction and Association Rules Aprori Algorithm Example

Association Rules Apriori Algorithm. Machine Learning Overview Sales Transaction and Association Rules Aprori Algorithm Example Association Rules Apriori Algorithm Machine Learning Overview Sales Transaction and Association Rules Aprori Algorithm Example 1 Machine Learning Common ground of presented methods Statistical Learning

More information

Clustering Algorithms. Data Mining Clustering. Distance. Example. More Than One Mean. Mean Clustering

Clustering Algorithms. Data Mining Clustering. Distance. Example. More Than One Mean. Mean Clustering Clustering Algorithms Data Mining Clustering Kevin Swingler Organise data into a number of distinct groups (clusters) according to the similarity of their members and their differences from other clusters

More information

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give

More information

Introduction to Big Data Analytics p. 1 Big Data Overview p. 2 Data Structures p. 5 Analyst Perspective on Data Repositories p.

Introduction to Big Data Analytics p. 1 Big Data Overview p. 2 Data Structures p. 5 Analyst Perspective on Data Repositories p. Introduction p. xvii Introduction to Big Data Analytics p. 1 Big Data Overview p. 2 Data Structures p. 5 Analyst Perspective on Data Repositories p. 9 State of the Practice in Analytics p. 11 BI Versus

More information

Apriori-Map/Reduce Algorithm

Apriori-Map/Reduce Algorithm Apriori-Map/Reduce Algorithm Jongwook Woo Computer Information Systems Department California State University Los Angeles, CA Abstract Map/Reduce algorithm has received highlights as cloud computing services

More information

16.1 MAPREDUCE. For personal use only, not for distribution. 333

16.1 MAPREDUCE. For personal use only, not for distribution. 333 For personal use only, not for distribution. 333 16.1 MAPREDUCE Initially designed by the Google labs and used internally by Google, the MAPREDUCE distributed programming model is now promoted by several

More information

Mammoth Scale Machine Learning!

Mammoth Scale Machine Learning! Mammoth Scale Machine Learning! Speaker: Robin Anil, Apache Mahout PMC Member! OSCON"10! Portland, OR! July 2010! Quick Show of Hands!# Are you fascinated about ML?!# Have you used ML?!# Do you have Gigabytes

More information

Interactive Analytical Processing in Big Data Systems,BDGS: AMay Scalable 23, 2014 Big Data1 Generat / 20

Interactive Analytical Processing in Big Data Systems,BDGS: AMay Scalable 23, 2014 Big Data1 Generat / 20 Interactive Analytical Processing in Big Data Systems,BDGS: A Scalable Big Data Generator Suite in Big Data Benchmarking,Study about DataSet May 23, 2014 Interactive Analytical Processing in Big Data Systems,BDGS:

More information

Introduction to Hadoop and MapReduce

Introduction to Hadoop and MapReduce Introduction to Hadoop and MapReduce THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION Large-scale Computation Traditional solutions for computing large quantities of data

More information

Machine Learning using MapReduce

Machine Learning using MapReduce Machine Learning using MapReduce What is Machine Learning Machine learning is a subfield of artificial intelligence concerned with techniques that allow computers to improve their outputs based on previous

More information

Big Data & Scripting Part II Streaming Algorithms

Big Data & Scripting Part II Streaming Algorithms Big Data & Scripting Part II Streaming Algorithms 1, 2, a note on sampling and filtering sampling: (randomly) choose a representative subset filtering: given some criterion (e.g. membership in a set),

More information

A Study on Workload Imbalance Issues in Data Intensive Distributed Computing

A Study on Workload Imbalance Issues in Data Intensive Distributed Computing A Study on Workload Imbalance Issues in Data Intensive Distributed Computing Sven Groot 1, Kazuo Goda 1, and Masaru Kitsuregawa 1 University of Tokyo, 4-6-1 Komaba, Meguro-ku, Tokyo 153-8505, Japan Abstract.

More information

Using Data Mining and Machine Learning in Retail

Using Data Mining and Machine Learning in Retail Using Data Mining and Machine Learning in Retail Omeid Seide Senior Manager, Big Data Solutions Sears Holdings Bharat Prasad Big Data Solution Architect Sears Holdings Over a Century of Innovation A Fortune

More information

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia Monitis Project Proposals for AUA September 2014, Yerevan, Armenia Distributed Log Collecting and Analysing Platform Project Specifications Category: Big Data and NoSQL Software Requirements: Apache Hadoop

More information

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components of Hadoop. We will see what types of nodes can exist in a Hadoop

More information

Load Balancing in MapReduce Based on Scalable Cardinality Estimates

Load Balancing in MapReduce Based on Scalable Cardinality Estimates Load Balancing in MapReduce Based on Scalable Cardinality Estimates Benjamin Gufler 1, Nikolaus Augsten #, Angelika Reiser 3, Alfons Kemper 4 Technische Universität München Boltzmannstraße 3, 85748 Garching

More information

Characterizing Task Usage Shapes in Google s Compute Clusters

Characterizing Task Usage Shapes in Google s Compute Clusters Characterizing Task Usage Shapes in Google s Compute Clusters Qi Zhang 1, Joseph L. Hellerstein 2, Raouf Boutaba 1 1 University of Waterloo, 2 Google Inc. Introduction Cloud computing is becoming a key

More information

KNIME TUTORIAL. Anna Monreale KDD-Lab, University of Pisa Email: annam@di.unipi.it

KNIME TUTORIAL. Anna Monreale KDD-Lab, University of Pisa Email: annam@di.unipi.it KNIME TUTORIAL Anna Monreale KDD-Lab, University of Pisa Email: annam@di.unipi.it Outline Introduction on KNIME KNIME components Exercise: Market Basket Analysis Exercise: Customer Segmentation Exercise:

More information

Chapter 7. Using Hadoop Cluster and MapReduce

Chapter 7. Using Hadoop Cluster and MapReduce Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in

More information

R at the front end and

R at the front end and Divide & Recombine for Large Complex Data (a.k.a. Big Data) 1 Statistical framework requiring research in statistical theory and methods to make it work optimally Framework is designed to make computation

More information

Joint Optimization of Overlapping Phases in MapReduce

Joint Optimization of Overlapping Phases in MapReduce Joint Optimization of Overlapping Phases in MapReduce Minghong Lin, Li Zhang, Adam Wierman, Jian Tan Abstract MapReduce is a scalable parallel computing framework for big data processing. It exhibits multiple

More information

Management & Analysis of Big Data in Zenith Team

Management & Analysis of Big Data in Zenith Team Management & Analysis of Big Data in Zenith Team Zenith Team, INRIA & LIRMM Outline Introduction to MapReduce Dealing with Data Skew in Big Data Processing Data Partitioning for MapReduce Frequent Sequence

More information

Data Mining Applications in Manufacturing

Data Mining Applications in Manufacturing Data Mining Applications in Manufacturing Dr Jenny Harding Senior Lecturer Wolfson School of Mechanical & Manufacturing Engineering, Loughborough University Identification of Knowledge - Context Intelligent

More information

Chapter 6: Episode discovery process

Chapter 6: Episode discovery process Chapter 6: Episode discovery process Algorithmic Methods of Data Mining, Fall 2005, Chapter 6: Episode discovery process 1 6. Episode discovery process The knowledge discovery process KDD process of analyzing

More information

Big Data with Rough Set Using Map- Reduce

Big Data with Rough Set Using Map- Reduce Big Data with Rough Set Using Map- Reduce Mr.G.Lenin 1, Mr. A. Raj Ganesh 2, Mr. S. Vanarasan 3 Assistant Professor, Department of CSE, Podhigai College of Engineering & Technology, Tirupattur, Tamilnadu,

More information

MINING THE DATA FROM DISTRIBUTED DATABASE USING AN IMPROVED MINING ALGORITHM

MINING THE DATA FROM DISTRIBUTED DATABASE USING AN IMPROVED MINING ALGORITHM MINING THE DATA FROM DISTRIBUTED DATABASE USING AN IMPROVED MINING ALGORITHM J. Arokia Renjit Asst. Professor/ CSE Department, Jeppiaar Engineering College, Chennai, TamilNadu,India 600119. Dr.K.L.Shunmuganathan

More information

Selection of Optimal Discount of Retail Assortments with Data Mining Approach

Selection of Optimal Discount of Retail Assortments with Data Mining Approach Available online at www.interscience.in Selection of Optimal Discount of Retail Assortments with Data Mining Approach Padmalatha Eddla, Ravinder Reddy, Mamatha Computer Science Department,CBIT, Gandipet,Hyderabad,A.P,India.

More information

Mining an Online Auctions Data Warehouse

Mining an Online Auctions Data Warehouse Proceedings of MASPLAS'02 The Mid-Atlantic Student Workshop on Programming Languages and Systems Pace University, April 19, 2002 Mining an Online Auctions Data Warehouse David Ulmer Under the guidance

More information

Knowledge Discovery and Data Mining. Structured vs. Non-Structured Data

Knowledge Discovery and Data Mining. Structured vs. Non-Structured Data Knowledge Discovery and Data Mining Unit # 2 1 Structured vs. Non-Structured Data Most business databases contain structured data consisting of well-defined fields with numeric or alphanumeric values.

More information

Case study: d60 Raptor smartadvisor. Jan Neerbek Alexandra Institute

Case study: d60 Raptor smartadvisor. Jan Neerbek Alexandra Institute Case study: d60 Raptor smartadvisor Jan Neerbek Alexandra Institute Agenda d60: A cloud/data mining case Cloud Data Mining Market Basket Analysis Large data sets Our solution 2 Alexandra Institute The

More information

Developing MapReduce Programs

Developing MapReduce Programs Cloud Computing Developing MapReduce Programs Dell Zhang Birkbeck, University of London 2015/16 MapReduce Algorithm Design MapReduce: Recap Programmers must specify two functions: map (k, v) * Takes

More information

Data Warehousing und Data Mining

Data Warehousing und Data Mining Data Warehousing und Data Mining Multidimensionale Indexstrukturen Ulf Leser Wissensmanagement in der Bioinformatik Content of this Lecture Multidimensional Indexing Grid-Files Kd-trees Ulf Leser: Data

More information

A Novel Cloud Based Elastic Framework for Big Data Preprocessing

A Novel Cloud Based Elastic Framework for Big Data Preprocessing School of Systems Engineering A Novel Cloud Based Elastic Framework for Big Data Preprocessing Omer Dawelbeit and Rachel McCrindle October 21, 2014 University of Reading 2008 www.reading.ac.uk Overview

More information

IMCM: A Flexible Fine-Grained Adaptive Framework for Parallel Mobile Hybrid Cloud Applications

IMCM: A Flexible Fine-Grained Adaptive Framework for Parallel Mobile Hybrid Cloud Applications Open System Laboratory of University of Illinois at Urbana Champaign presents: Outline: IMCM: A Flexible Fine-Grained Adaptive Framework for Parallel Mobile Hybrid Cloud Applications A Fine-Grained Adaptive

More information

Introduction to DISC and Hadoop

Introduction to DISC and Hadoop Introduction to DISC and Hadoop Alice E. Fischer April 24, 2009 Alice E. Fischer DISC... 1/20 1 2 History Hadoop provides a three-layer paradigm Alice E. Fischer DISC... 2/20 Parallel Computing Past and

More information

Big Data and Data Science: Behind the Buzz Words

Big Data and Data Science: Behind the Buzz Words Big Data and Data Science: Behind the Buzz Words Peggy Brinkmann, FCAS, MAAA Actuary Milliman, Inc. April 1, 2014 Contents Big data: from hype to value Deconstructing data science Managing big data Analyzing

More information

A Brief Introduction to Apache Tez

A Brief Introduction to Apache Tez A Brief Introduction to Apache Tez Introduction It is a fact that data is basically the new currency of the modern business world. Companies that effectively maximize the value of their data (extract value

More information

Improving Apriori Algorithm to get better performance with Cloud Computing

Improving Apriori Algorithm to get better performance with Cloud Computing Improving Apriori Algorithm to get better performance with Cloud Computing Zeba Qureshi 1 ; Sanjay Bansal 2 Affiliation: A.I.T.R, RGPV, India 1, A.I.T.R, RGPV, India 2 ABSTRACT Cloud computing has become

More information

Mining Interesting Medical Knowledge from Big Data

Mining Interesting Medical Knowledge from Big Data IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661,p-ISSN: 2278-8727, Volume 18, Issue 1, Ver. II (Jan Feb. 2016), PP 06-10 www.iosrjournals.org Mining Interesting Medical Knowledge from

More information

Large-Scale Test Mining

Large-Scale Test Mining Large-Scale Test Mining SIAM Conference on Data Mining Text Mining 2010 Alan Ratner Northrop Grumman Information Systems NORTHROP GRUMMAN PRIVATE / PROPRIETARY LEVEL I Aim Identify topic and language/script/coding

More information

Big Data and Analytics: A Conceptual Overview. Mike Park Erik Hoel

Big Data and Analytics: A Conceptual Overview. Mike Park Erik Hoel Big Data and Analytics: A Conceptual Overview Mike Park Erik Hoel In this technical workshop This presentation is for anyone that uses ArcGIS and is interested in analyzing large amounts of data We will

More information

Hadoop Ecosystem B Y R A H I M A.

Hadoop Ecosystem B Y R A H I M A. Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open

More information

Chapter 20: Data Analysis

Chapter 20: Data Analysis Chapter 20: Data Analysis Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Chapter 20: Data Analysis Decision Support Systems Data Warehousing Data Mining Classification

More information

Project Report. 1. Application Scenario

Project Report. 1. Application Scenario Project Report In this report, we briefly introduce the application scenario of association rule mining, give details of apriori algorithm implementation and comment on the mined rules. Also some instructions

More information

Hadoop-BAM and SeqPig

Hadoop-BAM and SeqPig Hadoop-BAM and SeqPig Keijo Heljanko 1, André Schumacher 1,2, Ridvan Döngelci 1, Luca Pireddu 3, Matti Niemenmaa 1, Aleksi Kallio 4, Eija Korpelainen 4, and Gianluigi Zanetti 3 1 Department of Computer

More information

GraySort on Apache Spark by Databricks

GraySort on Apache Spark by Databricks GraySort on Apache Spark by Databricks Reynold Xin, Parviz Deyhim, Ali Ghodsi, Xiangrui Meng, Matei Zaharia Databricks Inc. Apache Spark Sorting in Spark Overview Sorting Within a Partition Range Partitioner

More information

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model

More information

Hadoop Architecture. Part 1

Hadoop Architecture. Part 1 Hadoop Architecture Part 1 Node, Rack and Cluster: A node is simply a computer, typically non-enterprise, commodity hardware for nodes that contain data. Consider we have Node 1.Then we can add more nodes,

More information

Mining Large Datasets: Case of Mining Graph Data in the Cloud

Mining Large Datasets: Case of Mining Graph Data in the Cloud Mining Large Datasets: Case of Mining Graph Data in the Cloud Sabeur Aridhi PhD in Computer Science with Laurent d Orazio, Mondher Maddouri and Engelbert Mephu Nguifo 16/05/2014 Sabeur Aridhi Mining Large

More information

Exploring HADOOP as a Platform for Distributed Association Rule Mining

Exploring HADOOP as a Platform for Distributed Association Rule Mining FUTURE COMPUTING 2013 : The Fifth International Conference on Future Computational Technologies and Applications Exploring HADOOP as a Platform for Distributed Association Rule Mining Shravanth Oruganti,

More information

Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park

Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park Big Data: Using ArcGIS with Apache Hadoop Erik Hoel and Mike Park Outline Overview of Hadoop Adding GIS capabilities to Hadoop Integrating Hadoop with ArcGIS Apache Hadoop What is Hadoop? Hadoop is a scalable

More information

Optimization and analysis of large scale data sorting algorithm based on Hadoop

Optimization and analysis of large scale data sorting algorithm based on Hadoop Optimization and analysis of large scale sorting algorithm based on Hadoop Zhuo Wang, Longlong Tian, Dianjie Guo, Xiaoming Jiang Institute of Information Engineering, Chinese Academy of Sciences {wangzhuo,

More information

Frequent Itemset Mining for Big Data

Frequent Itemset Mining for Big Data Frequent Itemset Mining for Big Data Sandy Moens, Emin Aksehirli and Bart Goethals Universiteit Antwerpen, Belgium Email: firstname.lastname@uantwerpen.be Abstract Frequent Itemset Mining (FIM) is one

More information

Open source Google-style large scale data analysis with Hadoop

Open source Google-style large scale data analysis with Hadoop Open source Google-style large scale data analysis with Hadoop Ioannis Konstantinou Email: ikons@cslab.ece.ntua.gr Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory School of Electrical

More information

BIG DATA What it is and how to use?

BIG DATA What it is and how to use? BIG DATA What it is and how to use? Lauri Ilison, PhD Data Scientist 21.11.2014 Big Data definition? There is no clear definition for BIG DATA BIG DATA is more of a concept than precise term 1 21.11.14

More information

New Matrix Approach to Improve Apriori Algorithm

New Matrix Approach to Improve Apriori Algorithm New Matrix Approach to Improve Apriori Algorithm A. Rehab H. Alwa, B. Anasuya V Patil Associate Prof., IT Faculty, Majan College-University College Muscat, Oman, rehab.alwan@majancolleg.edu.om Associate

More information

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM Sneha D.Borkar 1, Prof.Chaitali S.Surtakar 2 Student of B.E., Information Technology, J.D.I.E.T, sborkar95@gmail.com Assistant Professor, Information

More information

A Performance Analysis of Distributed Indexing using Terrier

A Performance Analysis of Distributed Indexing using Terrier A Performance Analysis of Distributed Indexing using Terrier Amaury Couste Jakub Kozłowski William Martin Indexing Indexing Used by search

More information

The Performance Characteristics of MapReduce Applications on Scalable Clusters

The Performance Characteristics of MapReduce Applications on Scalable Clusters The Performance Characteristics of MapReduce Applications on Scalable Clusters Kenneth Wottrich Denison University Granville, OH 43023 wottri_k1@denison.edu ABSTRACT Many cluster owners and operators have

More information

Table of Contents. June 2010

Table of Contents. June 2010 June 2010 From: StatSoft Analytics White Papers To: Internal release Re: Performance comparison of STATISTICA Version 9 on multi-core 64-bit machines with current 64-bit releases of SAS (Version 9.2) and

More information

DWMiner : A tool for mining frequent item sets efficiently in data warehouses

DWMiner : A tool for mining frequent item sets efficiently in data warehouses DWMiner : A tool for mining frequent item sets efficiently in data warehouses Bruno Kinder Almentero, Alexandre Gonçalves Evsukoff and Marta Mattoso COPPE/Federal University of Rio de Janeiro, P.O.Box

More information

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel Parallel Databases Increase performance by performing operations in parallel Parallel Architectures Shared memory Shared disk Shared nothing closely coupled loosely coupled Parallelism Terminology Speedup:

More information

Advanced In-Database Analytics

Advanced In-Database Analytics Advanced In-Database Analytics Tallinn, Sept. 25th, 2012 Mikko-Pekka Bertling, BDM Greenplum EMEA 1 That sounds complicated? 2 Who can tell me how best to solve this 3 What are the main mathematical functions??

More information

On the Cost of Mining Very Large Open Source Repositories

On the Cost of Mining Very Large Open Source Repositories On the Cost of Mining Very Large Open Source Repositories Sean Banerjee Carnegie Mellon University Bojan Cukic University of North Carolina at Charlotte BIGDSE, Florence 2015 Introduction Issue tracking

More information

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012 MapReduce and Hadoop Aaron Birkland Cornell Center for Advanced Computing January 2012 Motivation Simple programming model for Big Data Distributed, parallel but hides this Established success at petabyte

More information

An Overview of Knowledge Discovery Database and Data mining Techniques

An Overview of Knowledge Discovery Database and Data mining Techniques An Overview of Knowledge Discovery Database and Data mining Techniques Priyadharsini.C 1, Dr. Antony Selvadoss Thanamani 2 M.Phil, Department of Computer Science, NGM College, Pollachi, Coimbatore, Tamilnadu,

More information

So, how do you pronounce. Jilles Vreeken. Okay, now we can talk. So, what kind of data? binary. * multi-relational

So, how do you pronounce. Jilles Vreeken. Okay, now we can talk. So, what kind of data? binary. * multi-relational Simply Mining Data Jilles Vreeken So, how do you pronounce Exploratory Data Analysis Jilles Vreeken Jilles Yill less Vreeken Fray can 17 August 2015 Okay, now we can talk. 17 August 2015 The goal So, what

More information

Keywords Big Data; OODBMS; RDBMS; hadoop; EDM; learning analytics, data abundance.

Keywords Big Data; OODBMS; RDBMS; hadoop; EDM; learning analytics, data abundance. Volume 4, Issue 11, November 2014 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Analytics

More information

CAS CS 565, Data Mining

CAS CS 565, Data Mining CAS CS 565, Data Mining Course logistics Course webpage: http://www.cs.bu.edu/~evimaria/cs565-10.html Schedule: Mon Wed, 4-5:30 Instructor: Evimaria Terzi, evimaria@cs.bu.edu Office hours: Mon 2:30-4pm,

More information

Small Maximal Independent Sets and Faster Exact Graph Coloring

Small Maximal Independent Sets and Faster Exact Graph Coloring Small Maximal Independent Sets and Faster Exact Graph Coloring David Eppstein Univ. of California, Irvine Dept. of Information and Computer Science The Exact Graph Coloring Problem: Given an undirected

More information

Energy Efficient MapReduce

Energy Efficient MapReduce Energy Efficient MapReduce Motivation: Energy consumption is an important aspect of datacenters efficiency, the total power consumption in the united states has doubled from 2000 to 2005, representing

More information

Parallel Data Mining. Team 2 Flash Coders Team Research Investigation Presentation 2. Foundations of Parallel Computing Oct 2014

Parallel Data Mining. Team 2 Flash Coders Team Research Investigation Presentation 2. Foundations of Parallel Computing Oct 2014 Parallel Data Mining Team 2 Flash Coders Team Research Investigation Presentation 2 Foundations of Parallel Computing Oct 2014 Agenda Overview of topic Analysis of research papers Software design Overview

More information

Statistics for BIG data

Statistics for BIG data Statistics for BIG data Statistics for Big Data: Are Statisticians Ready? Dennis Lin Department of Statistics The Pennsylvania State University John Jordan and Dennis K.J. Lin (ICSA-Bulletine 2014) Before

More information

Parallel Computing. Benson Muite. benson.muite@ut.ee http://math.ut.ee/ benson. https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage

Parallel Computing. Benson Muite. benson.muite@ut.ee http://math.ut.ee/ benson. https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage Parallel Computing Benson Muite benson.muite@ut.ee http://math.ut.ee/ benson https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage 3 November 2014 Hadoop, Review Hadoop Hadoop History Hadoop Framework

More information

IBM SPSS Direct Marketing 23

IBM SPSS Direct Marketing 23 IBM SPSS Direct Marketing 23 Note Before using this information and the product it supports, read the information in Notices on page 25. Product Information This edition applies to version 23, release

More information

Using In-Memory Computing to Simplify Big Data Analytics

Using In-Memory Computing to Simplify Big Data Analytics SCALEOUT SOFTWARE Using In-Memory Computing to Simplify Big Data Analytics by Dr. William Bain, ScaleOut Software, Inc. 2012 ScaleOut Software, Inc. 12/27/2012 T he big data revolution is upon us, fed

More information

Evaluating HDFS I/O Performance on Virtualized Systems

Evaluating HDFS I/O Performance on Virtualized Systems Evaluating HDFS I/O Performance on Virtualized Systems Xin Tang xtang@cs.wisc.edu University of Wisconsin-Madison Department of Computer Sciences Abstract Hadoop as a Service (HaaS) has received increasing

More information

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015 Hadoop MapReduce and Spark Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015 Outline Hadoop Hadoop Import data on Hadoop Spark Spark features Scala MLlib MLlib

More information

Market Basket Analysis Algorithm with Map/Reduce of Cloud Computing

Market Basket Analysis Algorithm with Map/Reduce of Cloud Computing Market Basket Analysis Algorithm with Map/Reduce of Cloud Computing Jongwook Woo Computer Information Systems Department California State University Los Angeles, CA Abstract Map/Reduce approach has been

More information

IMPLEMENTATION OF P-PIC ALGORITHM IN MAP REDUCE TO HANDLE BIG DATA

IMPLEMENTATION OF P-PIC ALGORITHM IN MAP REDUCE TO HANDLE BIG DATA IMPLEMENTATION OF P-PIC ALGORITHM IN MAP REDUCE TO HANDLE BIG DATA Jayalatchumy D 1, Thambidurai. P 2 Abstract Clustering is a process of grouping objects that are similar among themselves but dissimilar

More information

Locality-Sensitive Operators for Parallel Main-Memory Database Clusters

Locality-Sensitive Operators for Parallel Main-Memory Database Clusters Locality-Sensitive Operators for Parallel Main-Memory Database Clusters Wolf Rödiger, Tobias Mühlbauer, Philipp Unterbrunner*, Angelika Reiser, Alfons Kemper, Thomas Neumann Technische Universität München,

More information

Developing a MapReduce Application

Developing a MapReduce Application TIE 12206 - Apache Hadoop Tampere University of Technology, Finland November, 2014 Outline 1 MapReduce Paradigm 2 Hadoop Default Ports 3 Outline 1 MapReduce Paradigm 2 Hadoop Default Ports 3 MapReduce

More information

Big Data Explained. An introduction to Big Data Science.

Big Data Explained. An introduction to Big Data Science. Big Data Explained An introduction to Big Data Science. 1 Presentation Agenda What is Big Data Why learn Big Data Who is it for How to start learning Big Data When to learn it Objective and Benefits of

More information

IBM SPSS Direct Marketing 22

IBM SPSS Direct Marketing 22 IBM SPSS Direct Marketing 22 Note Before using this information and the product it supports, read the information in Notices on page 25. Product Information This edition applies to version 22, release

More information

MapReduce Algorithms. Sergei Vassilvitskii. Saturday, August 25, 12

MapReduce Algorithms. Sergei Vassilvitskii. Saturday, August 25, 12 MapReduce Algorithms A Sense of Scale At web scales... Mail: Billions of messages per day Search: Billions of searches per day Social: Billions of relationships 2 A Sense of Scale At web scales... Mail:

More information

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Edward Bortnikov & Ronny Lempel Yahoo Labs, Haifa Indexing in Search Engines Information Retrieval s two main stages: Indexing process

More information

Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques

Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques Subhashree K 1, Prakash P S 2 1 Student, Kongu Engineering College, Perundurai, Erode 2 Assistant Professor,

More information