Statistical Learning Theory Meets Big Data


 Rosanna Sutton
 2 years ago
 Views:
Transcription
1 Statistical Learning Theory Meets Big Data Randomized algorithms for frequent itemsets Eli Upfal Brown University
2 Data, data, data In God we trust, all others (must) bring data Prof. W.E. Deming, Statistician, (attributed) Every two days now we create as much information as we did from the dawn of civilization up until 2003 Eric Schmidt, Google Exec. Chairman, 2009 Data explosion is bigger than Moore's Law Marissa Mayer, Yahoo! CEO, 2009
3 Big Data Every two days now we create as much information as we did from the dawn of civilization up until 2003 Eric Schmidt, Google Exec. Chairman, 2009 Data explosion is bigger than Moore's Law Marissa Mayer, Yahoo! CEO, 2009
4 Big Data Big data requires big storage and long execution time It's great to have big data but do we need to process all of it? Trade off accuracy of results for execution speed Can approximation algorithms for data analysis be fast and still guarantee highquality results?
5 Frequent Itemsets Market Basket Analysis You own a grocery store Have copy of receipts from sales For each customer, you have the list of products she bought Interested in what groups of products are bought together correlated items Business decisions Optimization
6 Settings Transactional dataset D Figure from Tan et al.  Introduction to Data Mining
7 Transactions Transactional dataset D transaction Figure from Tan et al.  Introduction to Data Mining
8 Items Transactions are built on items from a domain items Figure from Tan et al.  Introduction to Data Mining
9 Itemsets Sets of items are called itemsets Itemsets Figure from Tan et al.  Introduction to Data Mining
10 Frequency Frequency of an itemset X in dataset D: : fraction of transactions of D containing X Example: Milk: 4/5, {Bread, Milk}: 3/5 Figure from Tan et al.  Introduction to Data Mining
11 Frequent Itemsets Frequent Itemsets mining with respect to Find all itemsets X with frequency
12 Frequent Itemsets Frequent Itemsets mining with respect to Find all itemsets X with frequency Top K Frequent Itemsets Find all itemsets at least as frequent as the kth most frequent Association Rules X in a transaction is correlated with Y in the transaction
13 Frequent Itemsets mining Classical problem in data analysis There are algorithms to extract exact collection of FI's APriori [AS94], FPGrowth [HPY00], Eclat [Zaki00]
14 Frequent Itemsets mining Well studied classical problem in data analysis There are algorithms to extract exact collection of FI's APriori [AS94], FPGrowth [HPY00], Eclat [Zaki00] Running time depends on number of transactions in the dataset and on the number of frequent itemsets 10^8 transactions considered normal size 10^4 items 2^(10^4) itemsets total A transaction with d items contains 2^d itemsets
15 How to speed up mining? Decrease dependency on number of transactions:
16 How to speed up mining? Decrease dependency on number of transactions: 1.create random sample of dataset 2.extract frequent itemsets from sample with lower
17 How to speed up mining? Decrease dependency on number of transactions: 1.create random sample of dataset 2.extract frequent itemsets from sample with lower Sampling force us to accept approximate results Need to quantify what is acceptable
18 How to speed up mining? Decrease dependency on number of transactions: 1.create random sample of dataset 2.extract frequent itemsets from sample with lower Sampling force us to accept approximate results Need to quantify what is acceptable Given, extract collection W of FI's such that: Frequency 0 1
19 How to speed up mining? Decrease dependency on number of transactions: 1.create random sample of dataset 2.extract frequent itemsets from sample with lower Sampling force us to accept approximate results Need to quantify what is acceptable Given, extract collection W of FI's such that: Frequency 0 1 Must be in W
20 How to speed up mining? Decrease dependency on number of transactions: 1.create random sample of dataset 2.extract frequent itemsets from sample with lower Sampling force us to accept approximate results Need to quantify what is acceptable Given, extract collection W of FI's such that: Frequency 0 1 Must not be in W Must be in W
21 How to speed up mining? Decrease dependency on number of transactions: 1.create random sample of dataset 2.extract frequent itemsets from sample with lower Sampling force us to accept approximate results Need to quantify what is acceptable Given, extract collection W of FI's such that: Frequency 0 1 Must not be in W May be in W Must be in W
22 Observa(on Data mining is exploratory in nature Exact results are usually not needed We want a good, realistic idea of the process Fast analysis is more important... if it still guarantees goodenough results! 22 Riondato et al., PARMA, CIKM'12
23 How to speed up mining? Decrease dependency on number of transactions: 1.create random sample of dataset 2.extract frequent itemsets from sample with lower Sampling force us to accept approximate results Need to quantify what is acceptable Given, extract collection W of itemset from sample, such that: Frequency 0 1 Must not be in W May be in W Must be in W Problem: choose right sample size and right
24 Choosing If we have, for all itemsets X simultaneously then does the trick
25 Choosing If we have, for all itemsets X simultaneously then does the trick Frequency 0 1 Must not be in W May be in W Must be in W
26 Choosing If we have, for all itemsets X simultaneously then does the trick Frequency 0 1 Must not be in W May be in W Must be in W Need sample size S s.t., with prob. at least, for all itemsets X, we have
27 Choosing the sample size Naïve approach Given itemset X, frequency of X in sample S is distributed like a binomial: Use Chernoff bound to bound Apply Union bound over all itemsets to find S such that
28 Choosing the sample size Naïve approach Given itemset X, frequency of X in sample S is distributed like a binomial: Use Chernoff bound to bound Apply Union bound over all itemsets to find S such that Problem: There's an exponential number of itemsets Sample size would too large (linear the in number of elements)
29 VapnikChervonenkis Dimension Combinatorial property of a collection of subsets from a domain Measures the richness or expressivity of the collection of subsets If we know the VCdim of a collection of subsets, we can compute the minimum sample size needed to approximate the sizes of all the subsets using the sample
30 Range spaces VCDimension is defined on range spaces (B,R): range space B: domain R: collection of subsets from B (ranges) No restrictions: B can be infinite R can be infinite R can contain infinitelylarge subsets of B
31 VapnikChervonenkis Dimension Range space For any, define C is shattered if The VCDimension of (B,R) is the size of the largest shattered subset of B
32 Example of VCDimension y B = x
33 Example of VCDimension y B = R = all axisaligned rectangles in x
34 Example of VCDimension y B = R = all axisaligned rectangles in x
35 Example of VCDimension y B = R = all axisaligned rectangles in x Shattering 4 points: Easy Take any 4 points s.t. no 3 of them are aligned
36 Example of VCDimension y B = R = all axisaligned rectangles in x Shattering 4 points: Easy Take any 4 points s.t. no 3 of them are aligned y x
37 Example of VCDimension y B = R = all axisaligned rectangles in x Shattering 4 points: Easy Take any 4 points s.t. no 3 of them are aligned y Need 16 rectangles x
38 Example of VCDimension y B = R = all axisaligned rectangles in x Shattering 5 points?
39 Example of VCDimension y B = R = all axisaligned rectangles in x Shattering 5 points: impossible
40 Example of VCDimension y B = R = all axisaligned rectangles in x Shattering 5 points: impossible Take any 5 points One of them that is contained in all rectangles containing the other four Impossible to find a rectangle containing only the other four VC(B,R)=4
41 Approximating sizes of ranges Sample Theorem: Let (B,R) have VC(B,R) d. Given, let S be a collection of points from B sampled uniformly at random. If then, Can approximate sizes of all simultaneously No need of union bound
42 VCDimension for Frequent Itemsets B = dataset D (set of transactions)
43 VCDimension for Frequent Itemsets B = dataset D (set of transactions) Itemset X, = transactions of D containing X
44 VCDimension for Frequent Itemsets B = dataset D (set of transactions) Itemset X, = transactions of D containing X
45 VCDimension for Frequent Itemsets B = dataset D (set of transactions) Itemset X, = transactions of D containing X
46 VCDimension for Frequent Itemsets B = dataset D (set of transactions) Itemset X, = transactions of D containing X
47 VCDimension for Frequent Itemsets B = dataset D (set of transactions) Itemset X, = transactions of D containing X To approximate using the sample theorem, need upper bound to VC(B,R)
48 Upper Bound to VCDim for FI's B = dataset D (set of transactions) Itemset X, = transactions of D containing X Theorem: [The dindex of a dataset] Let d be the maximum integer such that D contains at least d transactions of length at least d. Then VC(B,R) d
49 The d index of the dataset dindex d: the maximum integer such that D contains at least d transactions of length at least d Example: this dataset has dindex d=3 bread beer milk coffee chips coke pasta bread coke chips milk coffee pasta milk Independent of D Can be computed with a single scan of the dataset d is an upper bound to the VCdimension 49
50 Choosing the right sample size Theorem Let Let D be a dataset and d be the max integer such that D contains at least d transactions of length at least d Let S be a collection of transactions of D sampled independently and uniformly at random, with Then
51 Algorithm To extract approximate collection of freq itemsets: 1.Compute d for the dataset D 2.Compute sample size (function of d) 3.Create random sample S from D 4.Extract Freq Itemsets from S using
52 Algorithm To extract approximate collection of Freq. Itemsets: 1.Compute d for the dataset D 2.Compute sample size given d 3.Create random sample S from D 4.Extract Frequent Itemsets from S using Theorem: With probability at least, the returned set W of itemsets satisfies the desired property Frequency 0 1 Must not be in W May be in W Must be in W
53 A closer look... Expression of sample size:
54 A closer look... Expression of sample size: d: the maximum integer such that D contains at least d transactions of length at least d
55 A closer look... Expression of sample size: d: the maximum integer such that D contains at least d transactions of length at least d does not depend on D does not depend on does not depend on number of itemsets
56 A closer look... Expression of sample size: d: the maximum integer such that D contains at least d transactions of length at least d does not depend on D does not depend on does not depend on number of itemsets The time to mine S does not depend on D!!!
57 Proof (Intuition) For a set of k transactions to be shattered, each transaction must appear in 2^(k1) different 's where X is an itemset A transaction only appears in the 's of the itemsets X it contains A transaction of length w contains 2^(w)1 itemsets Need w k for the transaction to belong to a shattered set of size k To shatter k transactions they must all have length k Max k for which it happens is upper bound to VCDim
58 Experiments Sample always fits into main memory Output always satisfies required approx. guarantees Frequency accuracy even better than guaranteed Mining time significantly improved
59 Conclusions We showed how VCdimension, a highly theoretical concept from statistical learning theory, can help in developing an efficient algorithm to approximate the collection of frequent itemsets, addressing one of the Big Data challenges through samp Riondato & Upfal: Efficient Discovery of Association Rules and Frequent Itemsets through Sampling with Tight Performance Guarantees. ECML/PKDD (1)
60 PARMA: A Randomized Algorithm for Approximate Associa;on Rules Mining in MapReduce Matteo Riondato, Justin A. DeBrabant, Rodrigo Fonseca, Eli Upfal
61 Goals Extract high quality approximation of FI(D,θ) by mining several small samples of D in parallel Adapt to available computational resources Minimize data replication and communication Achieve high parallelism 61 Riondato et al., PARMA, CIKM'12
62 Why parallelizing? For a very high quality approximation, the sample would still be large Sample creation time depends on size of dataset Parallelization allows us to achieve: higher accuracy (lower ε) higher confidence (lower δ) faster results Makes sense when dataset is stored on a distributed filesystem 62 Riondato et al., PARMA, CIKM'12
63 MapReduce framework for easy parallelization on large clusters of commodity machines Allows very fast analysis of large datasets Algorithms expressed through two functions: Map: (k,v) (k1,v1),(k2,v2),... Reduce: (r,(w1,w2,...)) (k1,v1),(k2,v2),... It is expensive to move data around (shuffling) Design goal: avoid data replication 63 Riondato et al., PARMA, CIKM'12
64 The op'miza'on framework Sample should fit into local memory No more samples than number of processors Exploit available computational resources to maximize quality of approximation Constraints and Objective function expressed in a Mixed Integer Non Linear Program Convex feasibility region and convex obj. function Easy and fast to solve with global MINLP solver 64 Riondato et al., PARMA, CIKM'12
65 The Algorithm: PARMA Very simple design 2 MapReduce Rounds 1 st round Map: create samples Reduce: mine each sample 2 nd round Map: identity function Reduce: aggregation/filtering of results 65 Riondato et al., PARMA, CIKM'12
66 1 st Round: Sampling and Mining Parallelization of the sequential algorithm Map: Sample Creation Random sampling with replacement... in parallel! Input: (transaction id, transaction) Output: (sample id, transaction) Reduce: Mining of each sample in parallel using lowered frequency threshold θε/2 Can use any sequential exact mining algorithm (APriori, FPGrowth,...) Input: (sample id, (transaction1, transaction2,...)) Output: (itemset, (frequency, confidence interval)) 66 Riondato et al., PARMA, CIKM'12
67 2 nd Phase: Aggrega:on Map: Identity function Reduce: Filtering and Aggregation Input: (itemset,((freq1,conf_int1),(freq2,conf_int2),...) Output: (itemset, (frequency, confidence interval)) An itemset is sent in output only if it was frequent in a sufficient number of samples (determined by the optimization framework) Confidence interval: smallest interval containing a sufficient number of estimations freq1,freq2,... Frequency: median of confidence interval 67 Riondato et al., PARMA, CIKM'12
68 Analysis Theorem: With probability at least 1δ, output is an εapproximation to FI(D,θ) Proof intuition: Boostinglike approach Outputs of the reducers in stage 1 are εapproximations for the local samples, with some probability 1φ If an itemset appears in enough εapproximations, then it is probably frequent in the dataset Consequences: better approximation, fewer false positives 68 Riondato et al., PARMA, CIKM'12
69 Implementa:on PARMA implemented as Java library for Hadoop Easy to run, possible integration with Apache Mahout FPGrowth used for the sample mining phase PARMA is agnostic to the choice of mining algorithm We used an existing Java implementation 1400 lines of code Available on github 69 Riondato et al., PARMA, CIKM'12
70 Experiments Run on Amazon Elastic MapReduce Up to 16 machines, with 17GB of RAM Artificial datasets using standard benchmark generators Various parameters (transaction size, items,...) Size between 5 to 50 million transactions Compared performances of PARMA with PFP and naive distributed counting PARMA performs significantly better (1.5x to 5x speed improvement) in all tests 70 Riondato et al., PARMA, CIKM'12
71 Runme Analysis Runtime does not grow with size of data Because individual sample size does not grow! Much faster than PFP, faster as data grows! PARMA gets faster with more machines, even with more data. 71 Riondato et al., PARMA, CIKM'12
72 Speedup Quasi linear speedup Stage 1 very close to optimal Stage 2 no speedup Running time depends on number of itemsets No way to control the number of itemsets Common in frequent ideal itemsets mining algorithms 7 total Stage 1 Stage 2 speedup Riondato et al., PARMA, CIKM' number of instances 72
73 Accuracy Output always εapproximation to FI(D,θ) Output not much larger than real collection Better frequency accuracy than guaranteed Smaller confidence intervals than guaranteed Riondato et al., PARMA, CIKM'12
74 Conclusions We showed how to efficiently use MapReduce to mine a highquality approximation of the frequent itemsets and association rules from a large dataset with minimum data replication and quasilinear speedup 74 Riondato et al., PARMA, CIKM'12
75 Publications Riondato, Upfal. Efficient Discovery of Association Rules and Frequent Itemsets through Sampling with Tight Performance Guarantees. ECML PKDD 2012 Riondato, DeBrabant, Fonseca, Upfal. PARMA: A Parallel Randomized Algorithm for Approximate Association Rules Mining in MapReduce. CIKM 2012
PARMA: A Parallel Randomized Algorithm for Approximate Association Rules Mining in MapReduce
PARMA: A Parallel Randomized Algorithm for Approximate Association Rules Mining in MapReduce Matteo Riondato Dept. of Computer Science Brown University Providence, RI, USA matteo@cs.brown.edu Justin A.
More informationMapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research
MapReduce and Distributed Data Analysis Google Research 1 Dealing With Massive Data 2 2 Dealing With Massive Data Polynomial Memory Sublinear RAM Sketches External Memory Property Testing 3 3 Dealing With
More informationMATTEO RIONDATO Curriculum vitae
MATTEO RIONDATO Curriculum vitae 100 Avenue of the Americas, 16 th Fl. New York, NY 10013, USA +1 646 292 6641 riondato@acm.org http://matteo.rionda.to EDUCATION Ph.D. Computer Science, Brown University,
More informationAnalytics on Big Data
Analytics on Big Data Riccardo Torlone Università Roma Tre Credits: Mohamed Eltabakh (WPI) Analytics The discovery and communication of meaningful patterns in data (Wikipedia) It relies on data analysis
More informationAbstract of Samplingbased Randomized Algorithms for Big Data Analytics by Matteo Riondato,
Abstract of Samplingbased Randomized Algorithms for Big Data Analytics by Matteo Riondato, Ph.D., Brown University, May 2014. Analyzing huge datasets becomes prohibitively slow when the dataset does not
More informationPerformance Analysis of Apriori Algorithm with Different Data Structures on Hadoop Cluster
Performance Analysis of Apriori Algorithm with Different Data Structures on Hadoop Cluster Sudhakar Singh Dept. of Computer Science Faculty of Science Banaras Hindu University Rakhi Garg Dept. of Computer
More informationDistributed Apriori in Hadoop MapReduce Framework
Distributed Apriori in Hadoop MapReduce Framework By Shulei Zhao (sz2352) and Rongxin Du (rd2537) Individual Contribution: Shulei Zhao: Implements centralized Apriori algorithm and input preprocessing
More informationLaboratory Module 8 Mining Frequent Itemsets Apriori Algorithm
Laboratory Module 8 Mining Frequent Itemsets Apriori Algorithm Purpose: key concepts in mining frequent itemsets understand the Apriori algorithm run Apriori in Weka GUI and in programatic way 1 Theoretical
More informationPREDICTIVE MODELING OF INTERTRANSACTION ASSOCIATION RULES A BUSINESS PERSPECTIVE
International Journal of Computer Science and Applications, Vol. 5, No. 4, pp 5769, 2008 Technomathematics Research Foundation PREDICTIVE MODELING OF INTERTRANSACTION ASSOCIATION RULES A BUSINESS PERSPECTIVE
More informationB490 Mining the Big Data. 0 Introduction
B490 Mining the Big Data 0 Introduction Qin Zhang 11 Data Mining What is Data Mining? A definition : Discovery of useful, possibly unexpected, patterns in data. 21 Data Mining What is Data Mining? A
More informationCSEE5430 Scalable Cloud Computing Lecture 2
CSEE5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 14.92015 1/36 Google MapReduce A scalable batch processing
More informationScalable Cloud Computing Solutions for Next Generation Sequencing Data
Scalable Cloud Computing Solutions for Next Generation Sequencing Data Matti Niemenmaa 1, Aleksi Kallio 2, André Schumacher 1, Petri Klemelä 2, Eija Korpelainen 2, and Keijo Heljanko 1 1 Department of
More informationAssociation Rules Apriori Algorithm. Machine Learning Overview Sales Transaction and Association Rules Aprori Algorithm Example
Association Rules Apriori Algorithm Machine Learning Overview Sales Transaction and Association Rules Aprori Algorithm Example 1 Machine Learning Common ground of presented methods Statistical Learning
More informationClustering Algorithms. Data Mining Clustering. Distance. Example. More Than One Mean. Mean Clustering
Clustering Algorithms Data Mining Clustering Kevin Swingler Organise data into a number of distinct groups (clusters) according to the similarity of their members and their differences from other clusters
More informationIntroduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data
Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give
More informationIntroduction to Big Data Analytics p. 1 Big Data Overview p. 2 Data Structures p. 5 Analyst Perspective on Data Repositories p.
Introduction p. xvii Introduction to Big Data Analytics p. 1 Big Data Overview p. 2 Data Structures p. 5 Analyst Perspective on Data Repositories p. 9 State of the Practice in Analytics p. 11 BI Versus
More informationAprioriMap/Reduce Algorithm
AprioriMap/Reduce Algorithm Jongwook Woo Computer Information Systems Department California State University Los Angeles, CA Abstract Map/Reduce algorithm has received highlights as cloud computing services
More information16.1 MAPREDUCE. For personal use only, not for distribution. 333
For personal use only, not for distribution. 333 16.1 MAPREDUCE Initially designed by the Google labs and used internally by Google, the MAPREDUCE distributed programming model is now promoted by several
More informationMammoth Scale Machine Learning!
Mammoth Scale Machine Learning! Speaker: Robin Anil, Apache Mahout PMC Member! OSCON"10! Portland, OR! July 2010! Quick Show of Hands!# Are you fascinated about ML?!# Have you used ML?!# Do you have Gigabytes
More informationInteractive Analytical Processing in Big Data Systems,BDGS: AMay Scalable 23, 2014 Big Data1 Generat / 20
Interactive Analytical Processing in Big Data Systems,BDGS: A Scalable Big Data Generator Suite in Big Data Benchmarking,Study about DataSet May 23, 2014 Interactive Analytical Processing in Big Data Systems,BDGS:
More informationIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION Largescale Computation Traditional solutions for computing large quantities of data
More informationMachine Learning using MapReduce
Machine Learning using MapReduce What is Machine Learning Machine learning is a subfield of artificial intelligence concerned with techniques that allow computers to improve their outputs based on previous
More informationBig Data & Scripting Part II Streaming Algorithms
Big Data & Scripting Part II Streaming Algorithms 1, 2, a note on sampling and filtering sampling: (randomly) choose a representative subset filtering: given some criterion (e.g. membership in a set),
More informationA Study on Workload Imbalance Issues in Data Intensive Distributed Computing
A Study on Workload Imbalance Issues in Data Intensive Distributed Computing Sven Groot 1, Kazuo Goda 1, and Masaru Kitsuregawa 1 University of Tokyo, 461 Komaba, Meguroku, Tokyo 1538505, Japan Abstract.
More informationUsing Data Mining and Machine Learning in Retail
Using Data Mining and Machine Learning in Retail Omeid Seide Senior Manager, Big Data Solutions Sears Holdings Bharat Prasad Big Data Solution Architect Sears Holdings Over a Century of Innovation A Fortune
More informationMonitis Project Proposals for AUA. September 2014, Yerevan, Armenia
Monitis Project Proposals for AUA September 2014, Yerevan, Armenia Distributed Log Collecting and Analysing Platform Project Specifications Category: Big Data and NoSQL Software Requirements: Apache Hadoop
More informationWelcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components
Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components of Hadoop. We will see what types of nodes can exist in a Hadoop
More informationLoad Balancing in MapReduce Based on Scalable Cardinality Estimates
Load Balancing in MapReduce Based on Scalable Cardinality Estimates Benjamin Gufler 1, Nikolaus Augsten #, Angelika Reiser 3, Alfons Kemper 4 Technische Universität München Boltzmannstraße 3, 85748 Garching
More informationCharacterizing Task Usage Shapes in Google s Compute Clusters
Characterizing Task Usage Shapes in Google s Compute Clusters Qi Zhang 1, Joseph L. Hellerstein 2, Raouf Boutaba 1 1 University of Waterloo, 2 Google Inc. Introduction Cloud computing is becoming a key
More informationKNIME TUTORIAL. Anna Monreale KDDLab, University of Pisa Email: annam@di.unipi.it
KNIME TUTORIAL Anna Monreale KDDLab, University of Pisa Email: annam@di.unipi.it Outline Introduction on KNIME KNIME components Exercise: Market Basket Analysis Exercise: Customer Segmentation Exercise:
More informationChapter 7. Using Hadoop Cluster and MapReduce
Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in
More informationR at the front end and
Divide & Recombine for Large Complex Data (a.k.a. Big Data) 1 Statistical framework requiring research in statistical theory and methods to make it work optimally Framework is designed to make computation
More informationJoint Optimization of Overlapping Phases in MapReduce
Joint Optimization of Overlapping Phases in MapReduce Minghong Lin, Li Zhang, Adam Wierman, Jian Tan Abstract MapReduce is a scalable parallel computing framework for big data processing. It exhibits multiple
More informationManagement & Analysis of Big Data in Zenith Team
Management & Analysis of Big Data in Zenith Team Zenith Team, INRIA & LIRMM Outline Introduction to MapReduce Dealing with Data Skew in Big Data Processing Data Partitioning for MapReduce Frequent Sequence
More informationData Mining Applications in Manufacturing
Data Mining Applications in Manufacturing Dr Jenny Harding Senior Lecturer Wolfson School of Mechanical & Manufacturing Engineering, Loughborough University Identification of Knowledge  Context Intelligent
More informationChapter 6: Episode discovery process
Chapter 6: Episode discovery process Algorithmic Methods of Data Mining, Fall 2005, Chapter 6: Episode discovery process 1 6. Episode discovery process The knowledge discovery process KDD process of analyzing
More informationBig Data with Rough Set Using Map Reduce
Big Data with Rough Set Using Map Reduce Mr.G.Lenin 1, Mr. A. Raj Ganesh 2, Mr. S. Vanarasan 3 Assistant Professor, Department of CSE, Podhigai College of Engineering & Technology, Tirupattur, Tamilnadu,
More informationMINING THE DATA FROM DISTRIBUTED DATABASE USING AN IMPROVED MINING ALGORITHM
MINING THE DATA FROM DISTRIBUTED DATABASE USING AN IMPROVED MINING ALGORITHM J. Arokia Renjit Asst. Professor/ CSE Department, Jeppiaar Engineering College, Chennai, TamilNadu,India 600119. Dr.K.L.Shunmuganathan
More informationSelection of Optimal Discount of Retail Assortments with Data Mining Approach
Available online at www.interscience.in Selection of Optimal Discount of Retail Assortments with Data Mining Approach Padmalatha Eddla, Ravinder Reddy, Mamatha Computer Science Department,CBIT, Gandipet,Hyderabad,A.P,India.
More informationMining an Online Auctions Data Warehouse
Proceedings of MASPLAS'02 The MidAtlantic Student Workshop on Programming Languages and Systems Pace University, April 19, 2002 Mining an Online Auctions Data Warehouse David Ulmer Under the guidance
More informationKnowledge Discovery and Data Mining. Structured vs. NonStructured Data
Knowledge Discovery and Data Mining Unit # 2 1 Structured vs. NonStructured Data Most business databases contain structured data consisting of welldefined fields with numeric or alphanumeric values.
More informationCase study: d60 Raptor smartadvisor. Jan Neerbek Alexandra Institute
Case study: d60 Raptor smartadvisor Jan Neerbek Alexandra Institute Agenda d60: A cloud/data mining case Cloud Data Mining Market Basket Analysis Large data sets Our solution 2 Alexandra Institute The
More informationDeveloping MapReduce Programs
Cloud Computing Developing MapReduce Programs Dell Zhang Birkbeck, University of London 2015/16 MapReduce Algorithm Design MapReduce: Recap Programmers must specify two functions: map (k, v) * Takes
More informationData Warehousing und Data Mining
Data Warehousing und Data Mining Multidimensionale Indexstrukturen Ulf Leser Wissensmanagement in der Bioinformatik Content of this Lecture Multidimensional Indexing GridFiles Kdtrees Ulf Leser: Data
More informationA Novel Cloud Based Elastic Framework for Big Data Preprocessing
School of Systems Engineering A Novel Cloud Based Elastic Framework for Big Data Preprocessing Omer Dawelbeit and Rachel McCrindle October 21, 2014 University of Reading 2008 www.reading.ac.uk Overview
More informationIMCM: A Flexible FineGrained Adaptive Framework for Parallel Mobile Hybrid Cloud Applications
Open System Laboratory of University of Illinois at Urbana Champaign presents: Outline: IMCM: A Flexible FineGrained Adaptive Framework for Parallel Mobile Hybrid Cloud Applications A FineGrained Adaptive
More informationIntroduction to DISC and Hadoop
Introduction to DISC and Hadoop Alice E. Fischer April 24, 2009 Alice E. Fischer DISC... 1/20 1 2 History Hadoop provides a threelayer paradigm Alice E. Fischer DISC... 2/20 Parallel Computing Past and
More informationBig Data and Data Science: Behind the Buzz Words
Big Data and Data Science: Behind the Buzz Words Peggy Brinkmann, FCAS, MAAA Actuary Milliman, Inc. April 1, 2014 Contents Big data: from hype to value Deconstructing data science Managing big data Analyzing
More informationA Brief Introduction to Apache Tez
A Brief Introduction to Apache Tez Introduction It is a fact that data is basically the new currency of the modern business world. Companies that effectively maximize the value of their data (extract value
More informationImproving Apriori Algorithm to get better performance with Cloud Computing
Improving Apriori Algorithm to get better performance with Cloud Computing Zeba Qureshi 1 ; Sanjay Bansal 2 Affiliation: A.I.T.R, RGPV, India 1, A.I.T.R, RGPV, India 2 ABSTRACT Cloud computing has become
More informationMining Interesting Medical Knowledge from Big Data
IOSR Journal of Computer Engineering (IOSRJCE) eissn: 22780661,pISSN: 22788727, Volume 18, Issue 1, Ver. II (Jan Feb. 2016), PP 0610 www.iosrjournals.org Mining Interesting Medical Knowledge from
More informationLargeScale Test Mining
LargeScale Test Mining SIAM Conference on Data Mining Text Mining 2010 Alan Ratner Northrop Grumman Information Systems NORTHROP GRUMMAN PRIVATE / PROPRIETARY LEVEL I Aim Identify topic and language/script/coding
More informationBig Data and Analytics: A Conceptual Overview. Mike Park Erik Hoel
Big Data and Analytics: A Conceptual Overview Mike Park Erik Hoel In this technical workshop This presentation is for anyone that uses ArcGIS and is interested in analyzing large amounts of data We will
More informationHadoop Ecosystem B Y R A H I M A.
Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open
More informationChapter 20: Data Analysis
Chapter 20: Data Analysis Database System Concepts, 6 th Ed. See www.dbbook.com for conditions on reuse Chapter 20: Data Analysis Decision Support Systems Data Warehousing Data Mining Classification
More informationProject Report. 1. Application Scenario
Project Report In this report, we briefly introduce the application scenario of association rule mining, give details of apriori algorithm implementation and comment on the mined rules. Also some instructions
More informationHadoopBAM and SeqPig
HadoopBAM and SeqPig Keijo Heljanko 1, André Schumacher 1,2, Ridvan Döngelci 1, Luca Pireddu 3, Matti Niemenmaa 1, Aleksi Kallio 4, Eija Korpelainen 4, and Gianluigi Zanetti 3 1 Department of Computer
More informationGraySort on Apache Spark by Databricks
GraySort on Apache Spark by Databricks Reynold Xin, Parviz Deyhim, Ali Ghodsi, Xiangrui Meng, Matei Zaharia Databricks Inc. Apache Spark Sorting in Spark Overview Sorting Within a Partition Range Partitioner
More informationCSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model
More informationHadoop Architecture. Part 1
Hadoop Architecture Part 1 Node, Rack and Cluster: A node is simply a computer, typically nonenterprise, commodity hardware for nodes that contain data. Consider we have Node 1.Then we can add more nodes,
More informationMining Large Datasets: Case of Mining Graph Data in the Cloud
Mining Large Datasets: Case of Mining Graph Data in the Cloud Sabeur Aridhi PhD in Computer Science with Laurent d Orazio, Mondher Maddouri and Engelbert Mephu Nguifo 16/05/2014 Sabeur Aridhi Mining Large
More informationExploring HADOOP as a Platform for Distributed Association Rule Mining
FUTURE COMPUTING 2013 : The Fifth International Conference on Future Computational Technologies and Applications Exploring HADOOP as a Platform for Distributed Association Rule Mining Shravanth Oruganti,
More informationBig Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park
Big Data: Using ArcGIS with Apache Hadoop Erik Hoel and Mike Park Outline Overview of Hadoop Adding GIS capabilities to Hadoop Integrating Hadoop with ArcGIS Apache Hadoop What is Hadoop? Hadoop is a scalable
More informationOptimization and analysis of large scale data sorting algorithm based on Hadoop
Optimization and analysis of large scale sorting algorithm based on Hadoop Zhuo Wang, Longlong Tian, Dianjie Guo, Xiaoming Jiang Institute of Information Engineering, Chinese Academy of Sciences {wangzhuo,
More informationFrequent Itemset Mining for Big Data
Frequent Itemset Mining for Big Data Sandy Moens, Emin Aksehirli and Bart Goethals Universiteit Antwerpen, Belgium Email: firstname.lastname@uantwerpen.be Abstract Frequent Itemset Mining (FIM) is one
More informationOpen source Googlestyle large scale data analysis with Hadoop
Open source Googlestyle large scale data analysis with Hadoop Ioannis Konstantinou Email: ikons@cslab.ece.ntua.gr Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory School of Electrical
More informationBIG DATA What it is and how to use?
BIG DATA What it is and how to use? Lauri Ilison, PhD Data Scientist 21.11.2014 Big Data definition? There is no clear definition for BIG DATA BIG DATA is more of a concept than precise term 1 21.11.14
More informationNew Matrix Approach to Improve Apriori Algorithm
New Matrix Approach to Improve Apriori Algorithm A. Rehab H. Alwa, B. Anasuya V Patil Associate Prof., IT Faculty, Majan CollegeUniversity College Muscat, Oman, rehab.alwan@majancolleg.edu.om Associate
More informationA REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM
A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM Sneha D.Borkar 1, Prof.Chaitali S.Surtakar 2 Student of B.E., Information Technology, J.D.I.E.T, sborkar95@gmail.com Assistant Professor, Information
More informationA Performance Analysis of Distributed Indexing using Terrier
A Performance Analysis of Distributed Indexing using Terrier Amaury Couste Jakub Kozłowski William Martin Indexing Indexing Used by search
More informationThe Performance Characteristics of MapReduce Applications on Scalable Clusters
The Performance Characteristics of MapReduce Applications on Scalable Clusters Kenneth Wottrich Denison University Granville, OH 43023 wottri_k1@denison.edu ABSTRACT Many cluster owners and operators have
More informationTable of Contents. June 2010
June 2010 From: StatSoft Analytics White Papers To: Internal release Re: Performance comparison of STATISTICA Version 9 on multicore 64bit machines with current 64bit releases of SAS (Version 9.2) and
More informationDWMiner : A tool for mining frequent item sets efficiently in data warehouses
DWMiner : A tool for mining frequent item sets efficiently in data warehouses Bruno Kinder Almentero, Alexandre Gonçalves Evsukoff and Marta Mattoso COPPE/Federal University of Rio de Janeiro, P.O.Box
More informationParallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel
Parallel Databases Increase performance by performing operations in parallel Parallel Architectures Shared memory Shared disk Shared nothing closely coupled loosely coupled Parallelism Terminology Speedup:
More informationAdvanced InDatabase Analytics
Advanced InDatabase Analytics Tallinn, Sept. 25th, 2012 MikkoPekka Bertling, BDM Greenplum EMEA 1 That sounds complicated? 2 Who can tell me how best to solve this 3 What are the main mathematical functions??
More informationOn the Cost of Mining Very Large Open Source Repositories
On the Cost of Mining Very Large Open Source Repositories Sean Banerjee Carnegie Mellon University Bojan Cukic University of North Carolina at Charlotte BIGDSE, Florence 2015 Introduction Issue tracking
More informationMapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012
MapReduce and Hadoop Aaron Birkland Cornell Center for Advanced Computing January 2012 Motivation Simple programming model for Big Data Distributed, parallel but hides this Established success at petabyte
More informationAn Overview of Knowledge Discovery Database and Data mining Techniques
An Overview of Knowledge Discovery Database and Data mining Techniques Priyadharsini.C 1, Dr. Antony Selvadoss Thanamani 2 M.Phil, Department of Computer Science, NGM College, Pollachi, Coimbatore, Tamilnadu,
More informationSo, how do you pronounce. Jilles Vreeken. Okay, now we can talk. So, what kind of data? binary. * multirelational
Simply Mining Data Jilles Vreeken So, how do you pronounce Exploratory Data Analysis Jilles Vreeken Jilles Yill less Vreeken Fray can 17 August 2015 Okay, now we can talk. 17 August 2015 The goal So, what
More informationKeywords Big Data; OODBMS; RDBMS; hadoop; EDM; learning analytics, data abundance.
Volume 4, Issue 11, November 2014 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Analytics
More informationCAS CS 565, Data Mining
CAS CS 565, Data Mining Course logistics Course webpage: http://www.cs.bu.edu/~evimaria/cs56510.html Schedule: Mon Wed, 45:30 Instructor: Evimaria Terzi, evimaria@cs.bu.edu Office hours: Mon 2:304pm,
More informationSmall Maximal Independent Sets and Faster Exact Graph Coloring
Small Maximal Independent Sets and Faster Exact Graph Coloring David Eppstein Univ. of California, Irvine Dept. of Information and Computer Science The Exact Graph Coloring Problem: Given an undirected
More informationEnergy Efficient MapReduce
Energy Efficient MapReduce Motivation: Energy consumption is an important aspect of datacenters efficiency, the total power consumption in the united states has doubled from 2000 to 2005, representing
More informationParallel Data Mining. Team 2 Flash Coders Team Research Investigation Presentation 2. Foundations of Parallel Computing Oct 2014
Parallel Data Mining Team 2 Flash Coders Team Research Investigation Presentation 2 Foundations of Parallel Computing Oct 2014 Agenda Overview of topic Analysis of research papers Software design Overview
More informationStatistics for BIG data
Statistics for BIG data Statistics for Big Data: Are Statisticians Ready? Dennis Lin Department of Statistics The Pennsylvania State University John Jordan and Dennis K.J. Lin (ICSABulletine 2014) Before
More informationParallel Computing. Benson Muite. benson.muite@ut.ee http://math.ut.ee/ benson. https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage
Parallel Computing Benson Muite benson.muite@ut.ee http://math.ut.ee/ benson https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage 3 November 2014 Hadoop, Review Hadoop Hadoop History Hadoop Framework
More informationIBM SPSS Direct Marketing 23
IBM SPSS Direct Marketing 23 Note Before using this information and the product it supports, read the information in Notices on page 25. Product Information This edition applies to version 23, release
More informationUsing InMemory Computing to Simplify Big Data Analytics
SCALEOUT SOFTWARE Using InMemory Computing to Simplify Big Data Analytics by Dr. William Bain, ScaleOut Software, Inc. 2012 ScaleOut Software, Inc. 12/27/2012 T he big data revolution is upon us, fed
More informationEvaluating HDFS I/O Performance on Virtualized Systems
Evaluating HDFS I/O Performance on Virtualized Systems Xin Tang xtang@cs.wisc.edu University of WisconsinMadison Department of Computer Sciences Abstract Hadoop as a Service (HaaS) has received increasing
More informationHadoop MapReduce and Spark. Giorgio Pedrazzi, CINECASCAI School of Data Analytics and Visualisation Milan, 10/06/2015
Hadoop MapReduce and Spark Giorgio Pedrazzi, CINECASCAI School of Data Analytics and Visualisation Milan, 10/06/2015 Outline Hadoop Hadoop Import data on Hadoop Spark Spark features Scala MLlib MLlib
More informationMarket Basket Analysis Algorithm with Map/Reduce of Cloud Computing
Market Basket Analysis Algorithm with Map/Reduce of Cloud Computing Jongwook Woo Computer Information Systems Department California State University Los Angeles, CA Abstract Map/Reduce approach has been
More informationIMPLEMENTATION OF PPIC ALGORITHM IN MAP REDUCE TO HANDLE BIG DATA
IMPLEMENTATION OF PPIC ALGORITHM IN MAP REDUCE TO HANDLE BIG DATA Jayalatchumy D 1, Thambidurai. P 2 Abstract Clustering is a process of grouping objects that are similar among themselves but dissimilar
More informationLocalitySensitive Operators for Parallel MainMemory Database Clusters
LocalitySensitive Operators for Parallel MainMemory Database Clusters Wolf Rödiger, Tobias Mühlbauer, Philipp Unterbrunner*, Angelika Reiser, Alfons Kemper, Thomas Neumann Technische Universität München,
More informationDeveloping a MapReduce Application
TIE 12206  Apache Hadoop Tampere University of Technology, Finland November, 2014 Outline 1 MapReduce Paradigm 2 Hadoop Default Ports 3 Outline 1 MapReduce Paradigm 2 Hadoop Default Ports 3 MapReduce
More informationBig Data Explained. An introduction to Big Data Science.
Big Data Explained An introduction to Big Data Science. 1 Presentation Agenda What is Big Data Why learn Big Data Who is it for How to start learning Big Data When to learn it Objective and Benefits of
More informationIBM SPSS Direct Marketing 22
IBM SPSS Direct Marketing 22 Note Before using this information and the product it supports, read the information in Notices on page 25. Product Information This edition applies to version 22, release
More informationMapReduce Algorithms. Sergei Vassilvitskii. Saturday, August 25, 12
MapReduce Algorithms A Sense of Scale At web scales... Mail: Billions of messages per day Search: Billions of searches per day Social: Billions of relationships 2 A Sense of Scale At web scales... Mail:
More informationBig Data Technology MapReduce Motivation: Indexing in Search Engines
Big Data Technology MapReduce Motivation: Indexing in Search Engines Edward Bortnikov & Ronny Lempel Yahoo Labs, Haifa Indexing in Search Engines Information Retrieval s two main stages: Indexing process
More informationComparision of kmeans and kmedoids Clustering Algorithms for Big Data Using MapReduce Techniques
Comparision of kmeans and kmedoids Clustering Algorithms for Big Data Using MapReduce Techniques Subhashree K 1, Prakash P S 2 1 Student, Kongu Engineering College, Perundurai, Erode 2 Assistant Professor,
More information