Efficient Parallel Set-Similarity Joins Using MapReduce

Size: px
Start display at page:

Download "Efficient Parallel Set-Similarity Joins Using MapReduce"

Transcription

1 Efficient Parallel Set-Similarity Joins Using MapReduce Rares Vernica Michael J. Carey Chen Li Department of Computer Science University of California, Irvine Special Interest Group on Management of Data, 2010 Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD / 22

2 Outline 1 Motivation 2 Problem Statement 3 Algorithms 4 Experimental Evaluation Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD / 22

3 Example: Data Cleaning/Master-Data-Management Customer data from two departments Sales ID Name... S10 John W Smith.... Master customer data across two departments Customers ID Name... C30 John W Smith.... Returns ID Name... R20 Smith John.... Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD / 22

4 Example: Data Cleaning/Master-Data-Management Customer data from two departments Sales ID Name... S10 John W Smith.... Master customer data across two departments Customers ID Name... C30 John W Smith.... Returns ID Name... R20 Smith John.... Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD / 22

5 Example: Data Cleaning/Master-Data-Management Customer data from two departments Sales ID Name... S10 John W Smith.... Master customer data across two departments Customers ID Name... C30 John W Smith.... Returns ID Name... R20 Smith John.... Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD / 22

6 Example: Data Cleaning/Master-Data-Management Customer data from two departments Sales ID Name... S10 John W Smith.... Master customer data across two departments Customers ID Name... C30 John W Smith.... Returns ID Name... R20 John W Smith.... Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD / 22

7 Example: Data Cleaning/Master-Data-Management String Set S10: John W Smith {John, Smith, W} R20: Smith John {John, Smith} Set-Similarity Metric Jaccard similarity/tanimoto coefficient: jaccard(x, y) = x y x y jaccard(s10, R20) = 2 3 Set-Similarity Join Set-Similarity Join Sales and Returns on Name Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD / 22

8 Example: Data Cleaning/Master-Data-Management String Set S10: John W Smith {John, Smith, W} R20: Smith John {John, Smith} Set-Similarity Metric Jaccard similarity/tanimoto coefficient: jaccard(x, y) = x y x y jaccard(s10, R20) = 2 3 Set-Similarity Join Set-Similarity Join Sales and Returns on Name Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD / 22

9 Example: Data Cleaning/Master-Data-Management String Set S10: John W Smith {John, Smith, W} R20: Smith John {John, Smith} Set-Similarity Metric Jaccard similarity/tanimoto coefficient: jaccard(x, y) = x y x y jaccard(s10, R20) = 2 3 Set-Similarity Join Set-Similarity Join Sales and Returns on Name Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD / 22

10 Example: Data Cleaning/Master-Data-Management String Set S10: John W Smith {John, Smith, W} R20: Smith John {John, Smith} Set-Similarity Metric Jaccard similarity/tanimoto coefficient: jaccard(x, y) = x y x y jaccard(s10, R20) = 2 3 Set-Similarity Join Set-Similarity Join Sales and Returns on Name Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD / 22

11 Example: Data Cleaning/Master-Data-Management String Set S10: John W Smith {John, Smith, W} R20: Smith John {John, Smith} Set-Similarity Metric Jaccard similarity/tanimoto coefficient: jaccard(x, y) = x y x y jaccard(s10, R20) = 2 3 Set-Similarity Join Set-Similarity Join Sales and Returns on Name Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD / 22

12 Example: Data Cleaning/Master-Data-Management String Set S10: John W Smith {John, Smith, W} R20: Smith John {John, Smith} Set-Similarity Metric Jaccard similarity/tanimoto coefficient: jaccard(x, y) = x y x y jaccard(s10, R20) = 2 3 Set-Similarity Join Set-Similarity Join Sales and Returns on Name Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD / 22

13 Example: Data Cleaning/Master-Data-Management String Set S10: John W Smith {John, Smith, W} R20: Smith John {John, Smith} Set-Similarity Metric Jaccard similarity/tanimoto coefficient: jaccard(x, y) = x y x y jaccard(s10, R20) = 2 3 Set-Similarity Join Set-Similarity Join Sales and Returns on Name Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD / 22

14 Problem Statement: Set-Similarity Join Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD / 22

15 Problem Statement: Set-Similarity Join Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD / 22

16 Problem Statement: Set-Similarity Join Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD / 22

17 Problem Statement: Set-Similarity Join Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD / 22

18 Problem Statement: Set-Similarity Join Input Two files of records e.g., R(RID, a, b) and S(RID, c, d) A join column on each file e.g., R.a and S.c A similarity function, sim e.g., Jaccard A similarity threshold, τ Output All pairs of records from R and S where sim(r.a, S.c) τ Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD / 22

19 Parallelizing Set-Similarity Joins Large amounts of data E.g., GeneBank: 100M, Google N-gram: 1T Data or processing does not fit in one machine Use a cluster of machines and a parallel algorithm MapReduce: shared-nothing data-processing platform Challenges Partition problem for parallelism Solve using Map, Sort, and Reduce Compute end-to-end set-similarity joins Deal with out-of-memory situations Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD / 22

20 Parallelizing Set-Similarity Joins Large amounts of data E.g., GeneBank: 100M, Google N-gram: 1T Data or processing does not fit in one machine Use a cluster of machines and a parallel algorithm MapReduce: shared-nothing data-processing platform Challenges Partition problem for parallelism Solve using Map, Sort, and Reduce Compute end-to-end set-similarity joins Deal with out-of-memory situations Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD / 22

21 MapReduce Review map (k1,v1) list(k2,v2); reduce (k2,list(v2)) list(k3,v3). combine (k2,list(v2)) list(k2,v2). Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD / 22

22 MapReduce Review map (k1,v1) list(k2,v2); reduce (k2,list(v2)) list(k3,v3). combine (k2,list(v2)) list(k2,v2). Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD / 22

23 MapReduce Review map (k1,v1) list(k2,v2); reduce (k2,list(v2)) list(k3,v3). combine (k2,list(v2)) list(k2,v2). Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD / 22

24 Parallel Set-Similarity Joins in MapReduce Main idea Hash-partition data across the network based on keys Join values cannot be directly used as keys Generate signatures from join values and use them as keys e.g., John W Smith {John, Smith, W} Three Stages 1 Compute set-element statistics Records Statistics 2 Generate similar record-id ( RID ) pairs {Records, Statistics} RID-pairs 3 Generate pairs of joined records {Records, RID-pairs} Record-pairs Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD / 22

25 Parallel Set-Similarity Joins in MapReduce Main idea Hash-partition data across the network based on keys Join values cannot be directly used as keys Generate signatures from join values and use them as keys e.g., John W Smith {John, Smith, W} Three Stages 1 Compute set-element statistics Records Statistics 2 Generate similar record-id ( RID ) pairs {Records, Statistics} RID-pairs 3 Generate pairs of joined records {Records, RID-pairs} Record-pairs Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD / 22

26 Set-Similarity Filtering E.g., Prefix Filtering Pigeonhole principle Global order for set elements: Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD / 22

27 Set-Similarity Filtering E.g., Prefix Filtering Pigeonhole principle Global order for set elements: E.g., sim is overlap size, τ = 4 Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD / 22

28 Set-Similarity Filtering E.g., Prefix Filtering Pigeonhole principle Global order for set elements: E.g., sim is overlap size, τ = 4 Prefix length is 2 Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD / 22

29 Set-Similarity Filtering E.g., Prefix Filtering Pigeonhole principle Global order for set elements: E.g., sim is overlap size, τ = 4 Prefix length is 2 Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD / 22

30 Set-Similarity Filtering E.g., Prefix Filtering Pigeonhole principle Global order for set elements: E.g., sim is overlap size, τ = 4 Prefix length is 2 Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD / 22

31 Processing Stages and Alternatives Stage 1: Token Ordering Compute the token frequencies and sort Two MapReduce phases: sort in MapReduce (BTO) One MapReduce phase: sort in memory (OPTO) Stage 2: Kernel (RID-Pair Generation) Use prefix-filter to divide, conquer using: Nested loops (BK) Single-machine set-similarity join algorithm (PK) Stage 3: Record Join Generate pairs of similar records Two MapReduce phases: reduce-side join (BRJ) One MapReduce phase: map-side join (OPRJ) Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD / 22

32 Processing Stages and Alternatives Stage 1: Token Ordering Compute the token frequencies and sort Two MapReduce phases: sort in MapReduce (BTO) One MapReduce phase: sort in memory (OPTO) Stage 2: Kernel (RID-Pair Generation) Use prefix-filter to divide, conquer using: Nested loops (BK) Single-machine set-similarity join algorithm (PK) Stage 3: Record Join Generate pairs of similar records Two MapReduce phases: reduce-side join (BRJ) One MapReduce phase: map-side join (OPRJ) Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD / 22

33 Processing Stages and Alternatives Stage 1: Token Ordering Compute the token frequencies and sort Two MapReduce phases: sort in MapReduce (BTO) One MapReduce phase: sort in memory (OPTO) Stage 2: Kernel (RID-Pair Generation) Use prefix-filter to divide, conquer using: Nested loops (BK) Single-machine set-similarity join algorithm (PK) Stage 3: Record Join Generate pairs of similar records Two MapReduce phases: reduce-side join (BRJ) One MapReduce phase: map-side join (OPRJ) Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD / 22

34 Stage 2: RID-Pair Generation Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD / 22

35 Stage 2: RID-Pair Generation Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD / 22

36 Stage 2: RID-Pair Generation Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD / 22

37 Stage 2: RID-Pair Generation Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD / 22

38 Stage 2: RID-Pair Generation Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD / 22

39 Stage 2: RID-Pair Generation Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD / 22

40 Additional Contributions Solve R-S-join case Optimize memory requirements for R-S join Handle insufficient memory Introduce additional filters Use external memory Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD / 22

41 Experimental Setting Hardware 10-node IBM x3650 cluster Intel Xeon processor E GHz with four cores Four 300GB hard disks 12GB RAM Software Ubuntu 9.06, 64-bit, server edition OS Java 1.6, 64-bit, server Hadoop Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD / 22

42 Experimental Setting Datasets DBLP Average length: 259 bytes Number of records: 1.2M Total size: 300MB CITESEERX Average length: 1374 bytes Number of records: 1.3M Total size: 1.8GB Increased each up to 25, preserving join properties DBLP: 31M records, 8.2GB CITESEERX: 32M records, 45GB Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD / 22

43 Running Time Self-join DBLP n n [5, 25] 10-node cluster Best time Bulk of the time Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD / 22

44 Running Time Self-join DBLP n n [5, 25] 10-node cluster Best time Bulk of the time Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD / 22

45 Running Time Self-join DBLP n n [5, 25] 10-node cluster Best time Bulk of the time Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD / 22

46 Speedup Speedup BTO-BK-BRJ BTO-PK-BRJ BTO-PK-OPRJ Ideal Relative running time Self-join DBLP 10 Different cluster sizes # Nodes Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD / 22

47 Speedup Breakdown Stage 1 Stage 2 Stage BTO OPTO Ideal 4 BK PK Ideal 4 BRJ OPRJ Ideal Speedup 3 Speedup 3 Speedup # Nodes # Nodes Relative running time Self-join DBLP 10 Different cluster sizes # Nodes Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD / 22

48 Scaleup Time (seconds) BTO-BK-BRJ BTO-PK-BRJ BTO-PK-OPRJ Ideal # Nodes and Dataset Size Running time Self-joining DBLP n n [5, 25] Proportional cluster Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD / 22

49 Scaleup Breakdown Stage 1 Stage 2 Stage 3 Time (seconds) BTO OPTO Ideal # Nodes and Dataset Size Running time Time (seconds) Self-joining DBLP n, n [5, 25] Proportional cluster BK PK Ideal # Nodes and Dataset Size Time (seconds) BRJ OPRJ Ideal # Nodes and Dataset Size Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD / 22

50 Summary Set-similarity joins in MapReduce Three-stage approach Balance workload and minimize replication End-to-end algorithms Self-join R-S join Memory issues Optimize memory requirements Handle insufficient memory Experiments 40 cores, 40 disks cluster Speedup and scaleup Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD / 22

51 More Info A longer version of the paper, the source-code and the datasets: This work is part of the ASTERIX and Flamingo projects at UC Irvine. Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD / 22

52 Stage 1: Basic Token Ordering (BTO) Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD / 22

53 Stage 1: One Phase Token Ordering (OPTO) Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD / 22

54 Stage 2: Basic Kernel (BK) Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD / 22

55 Stage 2: Indexed Kernel (PK) Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD / 22

56 Stage 3: Basic Record Join (BRJ) Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD / 22

57 Stage 3: One Phase Record Join (OPRJ) Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD / 22

Large-Scale Data Cleaning Using Hadoop. UC Irvine

Large-Scale Data Cleaning Using Hadoop. UC Irvine Chen Li UC Irvine Joint work with Michael Carey, Alexander Behm, Shengyue Ji, Rares Vernica 1 Overview Importance of information Importance of information quality Data cleaning Large scale Hadoop 2 Data

More information

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel Parallel Databases Increase performance by performing operations in parallel Parallel Architectures Shared memory Shared disk Shared nothing closely coupled loosely coupled Parallelism Terminology Speedup:

More information

Document Similarity Measurement Using Ferret Algorithm and Map Reduce Programming Model

Document Similarity Measurement Using Ferret Algorithm and Map Reduce Programming Model Document Similarity Measurement Using Ferret Algorithm and Map Reduce Programming Model Condro Wibawa, Irwan Bastian, Metty Mustikasari Department of Information Systems, Faculty of Computer Science and

More information

Use of Hadoop File System for Nuclear Physics Analyses in STAR

Use of Hadoop File System for Nuclear Physics Analyses in STAR 1 Use of Hadoop File System for Nuclear Physics Analyses in STAR EVAN SANGALINE UC DAVIS Motivations 2 Data storage a key component of analysis requirements Transmission and storage across diverse resources

More information

Performance Comparison of Intel Enterprise Edition for Lustre* software and HDFS for MapReduce Applications

Performance Comparison of Intel Enterprise Edition for Lustre* software and HDFS for MapReduce Applications Performance Comparison of Intel Enterprise Edition for Lustre software and HDFS for MapReduce Applications Rekha Singhal, Gabriele Pacciucci and Mukesh Gangadhar 2 Hadoop Introduc-on Open source MapReduce

More information

Performance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems

Performance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems Performance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems Rekha Singhal and Gabriele Pacciucci * Other names and brands may be claimed as the property of others. Lustre File

More information

MapReduce. from the paper. MapReduce: Simplified Data Processing on Large Clusters (2004)

MapReduce. from the paper. MapReduce: Simplified Data Processing on Large Clusters (2004) MapReduce from the paper MapReduce: Simplified Data Processing on Large Clusters (2004) What it is MapReduce is a programming model and an associated implementation for processing and generating large

More information

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM Maximizing Hadoop Performance and Storage Capacity with AltraHD TM Executive Summary The explosion of internet data, driven in large part by the growth of more and more powerful mobile devices, has created

More information

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU Heshan Li, Shaopeng Wang The Johns Hopkins University 3400 N. Charles Street Baltimore, Maryland 21218 {heshanli, shaopeng}@cs.jhu.edu 1 Overview

More information

Introduction to Hadoop

Introduction to Hadoop Introduction to Hadoop 1 What is Hadoop? the big data revolution extracting value from data cloud computing 2 Understanding MapReduce the word count problem more examples MCS 572 Lecture 24 Introduction

More information

Bringing Big Data Modelling into the Hands of Domain Experts

Bringing Big Data Modelling into the Hands of Domain Experts Bringing Big Data Modelling into the Hands of Domain Experts David Willingham Senior Application Engineer MathWorks david.willingham@mathworks.com.au 2015 The MathWorks, Inc. 1 Data is the sword of the

More information

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Jeffrey D. Ullman slides. MapReduce for data intensive computing Jeffrey D. Ullman slides MapReduce for data intensive computing Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very

More information

Optimization and analysis of large scale data sorting algorithm based on Hadoop

Optimization and analysis of large scale data sorting algorithm based on Hadoop Optimization and analysis of large scale sorting algorithm based on Hadoop Zhuo Wang, Longlong Tian, Dianjie Guo, Xiaoming Jiang Institute of Information Engineering, Chinese Academy of Sciences {wangzhuo,

More information

Similarity Search in a Very Large Scale Using Hadoop and HBase

Similarity Search in a Very Large Scale Using Hadoop and HBase Similarity Search in a Very Large Scale Using Hadoop and HBase Stanislav Barton, Vlastislav Dohnal, Philippe Rigaux LAMSADE - Universite Paris Dauphine, France Internet Memory Foundation, Paris, France

More information

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

Scalable Cloud Computing Solutions for Next Generation Sequencing Data Scalable Cloud Computing Solutions for Next Generation Sequencing Data Matti Niemenmaa 1, Aleksi Kallio 2, André Schumacher 1, Petri Klemelä 2, Eija Korpelainen 2, and Keijo Heljanko 1 1 Department of

More information

Autodesk Inventor on the Macintosh

Autodesk Inventor on the Macintosh Autodesk Inventor on the Macintosh FREQUENTLY ASKED QUESTIONS 1. Can I install Autodesk Inventor on a Mac? 2. What is Boot Camp? 3. What is Parallels? 4. How does Boot Camp differ from Virtualization?

More information

A Performance Analysis of Distributed Indexing using Terrier

A Performance Analysis of Distributed Indexing using Terrier A Performance Analysis of Distributed Indexing using Terrier Amaury Couste Jakub Kozłowski William Martin Indexing Indexing Used by search

More information

EXPERIMENTATION. HARRISON CARRANZA School of Computer Science and Mathematics

EXPERIMENTATION. HARRISON CARRANZA School of Computer Science and Mathematics BIG DATA WITH HADOOP EXPERIMENTATION HARRISON CARRANZA Marist College APARICIO CARRANZA NYC College of Technology CUNY ECC Conference 2016 Poughkeepsie, NY, June 12-14, 2016 Marist College AGENDA Contents

More information

Benchmark Study on Distributed XML Filtering Using Hadoop Distribution Environment. Sanjay Kulhari, Jian Wen UC Riverside

Benchmark Study on Distributed XML Filtering Using Hadoop Distribution Environment. Sanjay Kulhari, Jian Wen UC Riverside Benchmark Study on Distributed XML Filtering Using Hadoop Distribution Environment Sanjay Kulhari, Jian Wen UC Riverside Team Sanjay Kulhari M.S. student, CS U C Riverside Jian Wen Ph.D. student, CS U

More information

Learning-based Entity Resolution with MapReduce

Learning-based Entity Resolution with MapReduce Learning-based Entity Resolution with MapReduce Lars Kolb 1 Hanna Köpcke 2 Andreas Thor 1 Erhard Rahm 1,2 1 Database Group, 2 WDI Lab University of Leipzig {kolb,koepcke,thor,rahm}@informatik.uni-leipzig.de

More information

A Comparison of Approaches to Large-Scale Data Analysis

A Comparison of Approaches to Large-Scale Data Analysis A Comparison of Approaches to Large-Scale Data Analysis Sam Madden MIT CSAIL with Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel Abadi, David DeWitt, and Michael Stonebraker In SIGMOD 2009 MapReduce

More information

Chapter 7. Using Hadoop Cluster and MapReduce

Chapter 7. Using Hadoop Cluster and MapReduce Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in

More information

Lecture Data Warehouse Systems

Lecture Data Warehouse Systems Lecture Data Warehouse Systems Eva Zangerle SS 2013 PART C: Novel Approaches in DW NoSQL and MapReduce Stonebraker on Data Warehouses Star and snowflake schemas are a good idea in the DW world C-Stores

More information

Developing MapReduce Programs

Developing MapReduce Programs Cloud Computing Developing MapReduce Programs Dell Zhang Birkbeck, University of London 2015/16 MapReduce Algorithm Design MapReduce: Recap Programmers must specify two functions: map (k, v) * Takes

More information

Lecture 10 - Functional programming: Hadoop and MapReduce

Lecture 10 - Functional programming: Hadoop and MapReduce Lecture 10 - Functional programming: Hadoop and MapReduce Sohan Dharmaraja Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 1 / 41 For today Big Data and Text analytics Functional

More information

Energy Efficient MapReduce

Energy Efficient MapReduce Energy Efficient MapReduce Motivation: Energy consumption is an important aspect of datacenters efficiency, the total power consumption in the united states has doubled from 2000 to 2005, representing

More information

CSE-E5430 Scalable Cloud Computing Lecture 2

CSE-E5430 Scalable Cloud Computing Lecture 2 CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 14.9-2015 1/36 Google MapReduce A scalable batch processing

More information

Hadoop Architecture. Part 1

Hadoop Architecture. Part 1 Hadoop Architecture Part 1 Node, Rack and Cluster: A node is simply a computer, typically non-enterprise, commodity hardware for nodes that contain data. Consider we have Node 1.Then we can add more nodes,

More information

Hadoop Cluster Applications

Hadoop Cluster Applications Hadoop Overview Data analytics has become a key element of the business decision process over the last decade. Classic reporting on a dataset stored in a database was sufficient until recently, but yesterday

More information

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA http://kzhang6.people.uic.edu/tutorial/amcis2014.html August 7, 2014 Schedule I. Introduction to big data

More information

16.1 MAPREDUCE. For personal use only, not for distribution. 333

16.1 MAPREDUCE. For personal use only, not for distribution. 333 For personal use only, not for distribution. 333 16.1 MAPREDUCE Initially designed by the Google labs and used internally by Google, the MAPREDUCE distributed programming model is now promoted by several

More information

GraySort on Apache Spark by Databricks

GraySort on Apache Spark by Databricks GraySort on Apache Spark by Databricks Reynold Xin, Parviz Deyhim, Ali Ghodsi, Xiangrui Meng, Matei Zaharia Databricks Inc. Apache Spark Sorting in Spark Overview Sorting Within a Partition Range Partitioner

More information

An Experimental Approach Towards Big Data for Analyzing Memory Utilization on a Hadoop cluster using HDFS and MapReduce.

An Experimental Approach Towards Big Data for Analyzing Memory Utilization on a Hadoop cluster using HDFS and MapReduce. An Experimental Approach Towards Big Data for Analyzing Memory Utilization on a Hadoop cluster using HDFS and MapReduce. Amrit Pal Stdt, Dept of Computer Engineering and Application, National Institute

More information

XTM Web 2.0 Enterprise Architecture Hardware Implementation Guidelines. A.Zydroń 18 April 2009. Page 1 of 12

XTM Web 2.0 Enterprise Architecture Hardware Implementation Guidelines. A.Zydroń 18 April 2009. Page 1 of 12 XTM Web 2.0 Enterprise Architecture Hardware Implementation Guidelines A.Zydroń 18 April 2009 Page 1 of 12 1. Introduction...3 2. XTM Database...4 3. JVM and Tomcat considerations...5 4. XTM Engine...5

More information

Distributed Image Processing using Hadoop MapReduce framework. Binoy A Fernandez (200950006) Sameer Kumar (200950031)

Distributed Image Processing using Hadoop MapReduce framework. Binoy A Fernandez (200950006) Sameer Kumar (200950031) using Hadoop MapReduce framework Binoy A Fernandez (200950006) Sameer Kumar (200950031) Objective To demonstrate how the hadoop mapreduce framework can be extended to work with image data for distributed

More information

IMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAP-REDUCE

IMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAP-REDUCE IMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAP-REDUCE Mr. Santhosh S 1, Mr. Hemanth Kumar G 2 1 PG Scholor, 2 Asst. Professor, Dept. Of Computer Science & Engg, NMAMIT, (India) ABSTRACT

More information

Big Data Storage, Management and challenges. Ahmed Ali-Eldin

Big Data Storage, Management and challenges. Ahmed Ali-Eldin Big Data Storage, Management and challenges Ahmed Ali-Eldin (Ambitious) Plan What is Big Data? And Why talk about Big Data? How to store Big Data? BigTables (Google) Dynamo (Amazon) How to process Big

More information

Study of Load Balancing of Resource Namespace Service

Study of Load Balancing of Resource Namespace Service Study of Load Balancing of Resource Namespace Service Masahiro Nakamura, Osamu Tatebe University of Tsukuba Background Resource Namespace Service (RNS) is published as GDF.101 by OGF RNS is intended to

More information

Introduction to Hadoop

Introduction to Hadoop 1 What is Hadoop? Introduction to Hadoop We are living in an era where large volumes of data are available and the problem is to extract meaning from the data avalanche. The goal of the software tools

More information

The Performance Characteristics of MapReduce Applications on Scalable Clusters

The Performance Characteristics of MapReduce Applications on Scalable Clusters The Performance Characteristics of MapReduce Applications on Scalable Clusters Kenneth Wottrich Denison University Granville, OH 43023 wottri_k1@denison.edu ABSTRACT Many cluster owners and operators have

More information

GraySort and MinuteSort at Yahoo on Hadoop 0.23

GraySort and MinuteSort at Yahoo on Hadoop 0.23 GraySort and at Yahoo on Hadoop.23 Thomas Graves Yahoo! May, 213 The Apache Hadoop[1] software library is an open source framework that allows for the distributed processing of large data sets across clusters

More information

Table 1. Requirements for Domain Controller. You will need a Microsoft Active Directory domain. Microsoft SQL Server. SQL Server Reporting Services

Table 1. Requirements for Domain Controller. You will need a Microsoft Active Directory domain. Microsoft SQL Server. SQL Server Reporting Services The following tables provide the various requirements and recommendations needed to optimize JustWare system performance. Note that the specifications provided in this section are only minimum specifications.

More information

Log Mining Based on Hadoop s Map and Reduce Technique

Log Mining Based on Hadoop s Map and Reduce Technique Log Mining Based on Hadoop s Map and Reduce Technique ABSTRACT: Anuja Pandit Department of Computer Science, anujapandit25@gmail.com Amruta Deshpande Department of Computer Science, amrutadeshpande1991@gmail.com

More information

RECOMMENDATION SYSTEM USING BLOOM FILTER IN MAPREDUCE

RECOMMENDATION SYSTEM USING BLOOM FILTER IN MAPREDUCE RECOMMENDATION SYSTEM USING BLOOM FILTER IN MAPREDUCE Reena Pagare and Anita Shinde Department of Computer Engineering, Pune University M. I. T. College Of Engineering Pune India ABSTRACT Many clients

More information

Tackling Big Data with MATLAB Adam Filion Application Engineer MathWorks, Inc.

Tackling Big Data with MATLAB Adam Filion Application Engineer MathWorks, Inc. Tackling Big Data with MATLAB Adam Filion Application Engineer MathWorks, Inc. 2015 The MathWorks, Inc. 1 Challenges of Big Data Any collection of data sets so large and complex that it becomes difficult

More information

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components of Hadoop. We will see what types of nodes can exist in a Hadoop

More information

A Load-Balanced MapReduce Algorithm for Blocking-based Entity-resolution with Multiple Keys

A Load-Balanced MapReduce Algorithm for Blocking-based Entity-resolution with Multiple Keys Proceedings of the Twelfth Australasian Symposium on Parallel and Distributed Computing (AusPDC 0), Auckland, New Zealand A Load-Balanced MapReduce Algorithm for Blocking-based Entity-resolution with Multiple

More information

High Performance Computing in CST STUDIO SUITE

High Performance Computing in CST STUDIO SUITE High Performance Computing in CST STUDIO SUITE Felix Wolfheimer GPU Computing Performance Speedup 18 16 14 12 10 8 6 4 2 0 Promo offer for EUC participants: 25% discount for K40 cards Speedup of Solver

More information

Big Data and Apache Hadoop s MapReduce

Big Data and Apache Hadoop s MapReduce Big Data and Apache Hadoop s MapReduce Michael Hahsler Computer Science and Engineering Southern Methodist University January 23, 2012 Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 1 / 23

More information

Large-Scale Test Mining

Large-Scale Test Mining Large-Scale Test Mining SIAM Conference on Data Mining Text Mining 2010 Alan Ratner Northrop Grumman Information Systems NORTHROP GRUMMAN PRIVATE / PROPRIETARY LEVEL I Aim Identify topic and language/script/coding

More information

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Edward Bortnikov & Ronny Lempel Yahoo Labs, Haifa Indexing in Search Engines Information Retrieval s two main stages: Indexing process

More information

MapReduce and Hadoop Distributed File System

MapReduce and Hadoop Distributed File System MapReduce and Hadoop Distributed File System 1 B. RAMAMURTHY Contact: Dr. Bina Ramamurthy CSE Department University at Buffalo (SUNY) bina@buffalo.edu http://www.cse.buffalo.edu/faculty/bina Partially

More information

R at the front end and

R at the front end and Divide & Recombine for Large Complex Data (a.k.a. Big Data) 1 Statistical framework requiring research in statistical theory and methods to make it work optimally Framework is designed to make computation

More information

A Novel Cloud Based Elastic Framework for Big Data Preprocessing

A Novel Cloud Based Elastic Framework for Big Data Preprocessing School of Systems Engineering A Novel Cloud Based Elastic Framework for Big Data Preprocessing Omer Dawelbeit and Rachel McCrindle October 21, 2014 University of Reading 2008 www.reading.ac.uk Overview

More information

Main Memory Data Warehouses

Main Memory Data Warehouses Main Memory Data Warehouses Robert Wrembel Poznan University of Technology Institute of Computing Science Robert.Wrembel@cs.put.poznan.pl www.cs.put.poznan.pl/rwrembel Lecture outline Teradata Data Warehouse

More information

Big Data Rethink Algos and Architecture. Scott Marsh Manager R&D Personal Lines Auto Pricing

Big Data Rethink Algos and Architecture. Scott Marsh Manager R&D Personal Lines Auto Pricing Big Data Rethink Algos and Architecture Scott Marsh Manager R&D Personal Lines Auto Pricing Agenda History Map Reduce Algorithms History Google talks about their solutions to their problems Map Reduce:

More information

Redundant Data Removal Technique for Efficient Big Data Search Processing

Redundant Data Removal Technique for Efficient Big Data Search Processing Redundant Data Removal Technique for Efficient Big Data Search Processing Seungwoo Jeon 1, Bonghee Hong 1, Joonho Kwon 2, Yoon-sik Kwak 3 and Seok-il Song 3 1 Dept. of Computer Engineering, Pusan National

More information

From GWS to MapReduce: Google s Cloud Technology in the Early Days

From GWS to MapReduce: Google s Cloud Technology in the Early Days Large-Scale Distributed Systems From GWS to MapReduce: Google s Cloud Technology in the Early Days Part II: MapReduce in a Datacenter COMP6511A Spring 2014 HKUST Lin Gu lingu@ieee.org MapReduce/Hadoop

More information

Rethinking SIMD Vectorization for In-Memory Databases

Rethinking SIMD Vectorization for In-Memory Databases SIGMOD 215, Melbourne, Victoria, Australia Rethinking SIMD Vectorization for In-Memory Databases Orestis Polychroniou Columbia University Arun Raghavan Oracle Labs Kenneth A. Ross Columbia University Latest

More information

2015 The MathWorks, Inc. 1

2015 The MathWorks, Inc. 1 25 The MathWorks, Inc. 빅 데이터 및 다양한 데이터 처리 위한 MATLAB의 인터페이스 환경 및 새로운 기능 엄준상 대리 Application Engineer MathWorks 25 The MathWorks, Inc. 2 Challenges of Data Any collection of data sets so large and complex

More information

Analysis and Modeling of MapReduce s Performance on Hadoop YARN

Analysis and Modeling of MapReduce s Performance on Hadoop YARN Analysis and Modeling of MapReduce s Performance on Hadoop YARN Qiuyi Tang Dept. of Mathematics and Computer Science Denison University tang_j3@denison.edu Dr. Thomas C. Bressoud Dept. of Mathematics and

More information

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging Outline High Performance Computing (HPC) Towards exascale computing: a brief history Challenges in the exascale era Big Data meets HPC Some facts about Big Data Technologies HPC and Big Data converging

More information

MapReduce Job Processing

MapReduce Job Processing April 17, 2012 Background: Hadoop Distributed File System (HDFS) Hadoop requires a Distributed File System (DFS), we utilize the Hadoop Distributed File System (HDFS). Background: Hadoop Distributed File

More information

External Sorting. Why Sort? 2-Way Sort: Requires 3 Buffers. Chapter 13

External Sorting. Why Sort? 2-Way Sort: Requires 3 Buffers. Chapter 13 External Sorting Chapter 13 Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1 Why Sort? A classic problem in computer science! Data requested in sorted order e.g., find students in increasing

More information

Mining Large Datasets: Case of Mining Graph Data in the Cloud

Mining Large Datasets: Case of Mining Graph Data in the Cloud Mining Large Datasets: Case of Mining Graph Data in the Cloud Sabeur Aridhi PhD in Computer Science with Laurent d Orazio, Mondher Maddouri and Engelbert Mephu Nguifo 16/05/2014 Sabeur Aridhi Mining Large

More information

EFFICIENT GEAR-SHIFTING FOR A POWER-PROPORTIONAL DISTRIBUTED DATA-PLACEMENT METHOD

EFFICIENT GEAR-SHIFTING FOR A POWER-PROPORTIONAL DISTRIBUTED DATA-PLACEMENT METHOD EFFICIENT GEAR-SHIFTING FOR A POWER-PROPORTIONAL DISTRIBUTED DATA-PLACEMENT METHOD 2014/1/27 Hieu Hanh Le, Satoshi Hikida and Haruo Yokota Tokyo Institute of Technology 1.1 Background 2 Commodity-based

More information

BIG DATA USING HADOOP

BIG DATA USING HADOOP + Breakaway Session By Johnson Iyilade, Ph.D. University of Saskatchewan, Canada 23-July, 2015 BIG DATA USING HADOOP + Outline n Framing the Problem Hadoop Solves n Meet Hadoop n Storage with HDFS n Data

More information

Understanding the Benefits of IBM SPSS Statistics Server

Understanding the Benefits of IBM SPSS Statistics Server IBM SPSS Statistics Server Understanding the Benefits of IBM SPSS Statistics Server Contents: 1 Introduction 2 Performance 101: Understanding the drivers of better performance 3 Why performance is faster

More information

Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011

Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011 Scalable Data Analysis in R Lee E. Edlefsen Chief Scientist UserR! 2011 1 Introduction Our ability to collect and store data has rapidly been outpacing our ability to analyze it We need scalable data analysis

More information

Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software

Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software WHITEPAPER Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software SanDisk ZetaScale software unlocks the full benefits of flash for In-Memory Compute and NoSQL applications

More information

How To Scale Out Of A Nosql Database

How To Scale Out Of A Nosql Database Firebird meets NoSQL (Apache HBase) Case Study Firebird Conference 2011 Luxembourg 25.11.2011 26.11.2011 Thomas Steinmaurer DI +43 7236 3343 896 thomas.steinmaurer@scch.at www.scch.at Michael Zwick DI

More information

Viswanath Nandigam Sriram Krishnan Chaitan Baru

Viswanath Nandigam Sriram Krishnan Chaitan Baru Viswanath Nandigam Sriram Krishnan Chaitan Baru Traditional Database Implementations for large-scale spatial data Data Partitioning Spatial Extensions Pros and Cons Cloud Computing Introduction Relevance

More information

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets

More information

Yuji Shirasaki (JVO NAOJ)

Yuji Shirasaki (JVO NAOJ) Yuji Shirasaki (JVO NAOJ) A big table : 20 billions of photometric data from various survey SDSS, TWOMASS, USNO-b1.0,GSC2.3,Rosat, UKIDSS, SDS(Subaru Deep Survey), VVDS (VLT), GDDS (Gemini), RXTE, GOODS,

More information

MapReduce and Hadoop Distributed File System V I J A Y R A O

MapReduce and Hadoop Distributed File System V I J A Y R A O MapReduce and Hadoop Distributed File System 1 V I J A Y R A O The Context: Big-data Man on the moon with 32KB (1969); my laptop had 2GB RAM (2009) Google collects 270PB data in a month (2007), 20000PB

More information

Reducer Load Balancing and Lazy Initialization in Map Reduce Environment S.Mohanapriya, P.Natesan

Reducer Load Balancing and Lazy Initialization in Map Reduce Environment S.Mohanapriya, P.Natesan Reducer Load Balancing and Lazy Initialization in Map Reduce Environment S.Mohanapriya, P.Natesan Abstract Big Data is revolutionizing 21st-century with increasingly huge amounts of data to store and be

More information

CS246: Mining Massive Datasets Jure Leskovec, Stanford University. http://cs246.stanford.edu

CS246: Mining Massive Datasets Jure Leskovec, Stanford University. http://cs246.stanford.edu CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu 2 CPU Memory Machine Learning, Statistics Classical Data Mining Disk 3 20+ billion web pages x 20KB = 400+ TB

More information

Research on Clustering Analysis of Big Data Yuan Yuanming 1, 2, a, Wu Chanle 1, 2

Research on Clustering Analysis of Big Data Yuan Yuanming 1, 2, a, Wu Chanle 1, 2 Advanced Engineering Forum Vols. 6-7 (2012) pp 82-87 Online: 2012-09-26 (2012) Trans Tech Publications, Switzerland doi:10.4028/www.scientific.net/aef.6-7.82 Research on Clustering Analysis of Big Data

More information

Cloud Computing Where ISR Data Will Go for Exploitation

Cloud Computing Where ISR Data Will Go for Exploitation Cloud Computing Where ISR Data Will Go for Exploitation 22 September 2009 Albert Reuther, Jeremy Kepner, Peter Michaleas, William Smith This work is sponsored by the Department of the Air Force under Air

More information

Autodesk 3ds Max 2010 Boot Camp FAQ

Autodesk 3ds Max 2010 Boot Camp FAQ Autodesk 3ds Max 2010 Boot Camp Frequently Asked Questions (FAQ) Frequently Asked Questions and Answers This document provides questions and answers about using Autodesk 3ds Max 2010 software with the

More information

SIGMOD RWE Review Towards Proximity Pattern Mining in Large Graphs

SIGMOD RWE Review Towards Proximity Pattern Mining in Large Graphs SIGMOD RWE Review Towards Proximity Pattern Mining in Large Graphs Fabian Hueske, TU Berlin June 26, 21 1 Review This document is a review report on the paper Towards Proximity Pattern Mining in Large

More information

Very Large Enterprise Network, Deployment, 25000+ Users

Very Large Enterprise Network, Deployment, 25000+ Users Very Large Enterprise Network, Deployment, 25000+ Users Websense software can be deployed in different configurations, depending on the size and characteristics of the network, and the organization s filtering

More information

How To Fix A Fault Notification On A Network Security Platform 8.0.0 (Xc) (Xcus) (Network) (Networks) (Manual) (Manager) (Powerpoint) (Cisco) (Permanent

How To Fix A Fault Notification On A Network Security Platform 8.0.0 (Xc) (Xcus) (Network) (Networks) (Manual) (Manager) (Powerpoint) (Cisco) (Permanent XC-Cluster Release Notes Network Security Platform 8.0 Revision A Contents About this document New features Resolved issues Known issues Installation instructions Product documentation About this document

More information

System requirements for A+

System requirements for A+ System requirements for A+ Anywhere Learning System: System Requirements Customer-hosted Browser Version Web-based ALS (WBA) Delivery Network Requirements In order to configure WBA+ to properly answer

More information

Introduction to Parallel Programming and MapReduce

Introduction to Parallel Programming and MapReduce Introduction to Parallel Programming and MapReduce Audience and Pre-Requisites This tutorial covers the basics of parallel programming and the MapReduce programming model. The pre-requisites are significant

More information

FPGA-based Multithreading for In-Memory Hash Joins

FPGA-based Multithreading for In-Memory Hash Joins FPGA-based Multithreading for In-Memory Hash Joins Robert J. Halstead, Ildar Absalyamov, Walid A. Najjar, Vassilis J. Tsotras University of California, Riverside Outline Background What are FPGAs Multithreaded

More information

Hardware Performance Optimization and Tuning. Presenter: Tom Arakelian Assistant: Guy Ingalls

Hardware Performance Optimization and Tuning. Presenter: Tom Arakelian Assistant: Guy Ingalls Hardware Performance Optimization and Tuning Presenter: Tom Arakelian Assistant: Guy Ingalls Agenda Server Performance Server Reliability Why we need Performance Monitoring How to optimize server performance

More information

Affinity Aware VM Colocation Mechanism for Cloud

Affinity Aware VM Colocation Mechanism for Cloud Affinity Aware VM Colocation Mechanism for Cloud Nilesh Pachorkar 1* and Rajesh Ingle 2 Received: 24-December-2014; Revised: 12-January-2015; Accepted: 12-January-2015 2014 ACCENTS Abstract The most of

More information

MapReduce on GPUs. Amit Sabne, Ahmad Mujahid Mohammed Razip, Kun Xu

MapReduce on GPUs. Amit Sabne, Ahmad Mujahid Mohammed Razip, Kun Xu 1 MapReduce on GPUs Amit Sabne, Ahmad Mujahid Mohammed Razip, Kun Xu 2 MapReduce MAP Shuffle Reduce 3 Hadoop Open-source MapReduce framework from Apache, written in Java Used by Yahoo!, Facebook, Ebay,

More information

Efficient Parallel Execution of Sequence Similarity Analysis Via Dynamic Load Balancing

Efficient Parallel Execution of Sequence Similarity Analysis Via Dynamic Load Balancing Efficient Parallel Execution of Sequence Similarity Analysis Via Dynamic Load Balancing James D. Jackson Philip J. Hatcher Department of Computer Science Kingsbury Hall University of New Hampshire Durham,

More information

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com Who Am I? Yahoo! Architect on Hadoop Map/Reduce Design, review, and implement features in Hadoop Working on Hadoop full time since Feb

More information

A Study on Workload Imbalance Issues in Data Intensive Distributed Computing

A Study on Workload Imbalance Issues in Data Intensive Distributed Computing A Study on Workload Imbalance Issues in Data Intensive Distributed Computing Sven Groot 1, Kazuo Goda 1, and Masaru Kitsuregawa 1 University of Tokyo, 4-6-1 Komaba, Meguro-ku, Tokyo 153-8505, Japan Abstract.

More information

Hadoop Scheduler w i t h Deadline Constraint

Hadoop Scheduler w i t h Deadline Constraint Hadoop Scheduler w i t h Deadline Constraint Geetha J 1, N UdayBhaskar 2, P ChennaReddy 3,Neha Sniha 4 1,4 Department of Computer Science and Engineering, M S Ramaiah Institute of Technology, Bangalore,

More information

Big Fast Data Hadoop acceleration with Flash. June 2013

Big Fast Data Hadoop acceleration with Flash. June 2013 Big Fast Data Hadoop acceleration with Flash June 2013 Agenda The Big Data Problem What is Hadoop Hadoop and Flash The Nytro Solution Test Results The Big Data Problem Big Data Output Facebook Traditional

More information

Overview. Introduction. Recommender Systems & Slope One Recommender. Distributed Slope One on Mahout and Hadoop. Experimental Setup and Analyses

Overview. Introduction. Recommender Systems & Slope One Recommender. Distributed Slope One on Mahout and Hadoop. Experimental Setup and Analyses Slope One Recommender on Hadoop YONG ZHENG Center for Web Intelligence DePaul University Nov 15, 2012 Overview Introduction Recommender Systems & Slope One Recommender Distributed Slope One on Mahout and

More information

MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research

MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research MapReduce and Distributed Data Analysis Google Research 1 Dealing With Massive Data 2 2 Dealing With Massive Data Polynomial Memory Sublinear RAM Sketches External Memory Property Testing 3 3 Dealing With

More information

Facebook: Cassandra. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation

Facebook: Cassandra. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation Facebook: Cassandra Smruti R. Sarangi Department of Computer Science Indian Institute of Technology New Delhi, India Smruti R. Sarangi Leader Election 1/24 Outline 1 2 3 Smruti R. Sarangi Leader Election

More information

Copyright 1999-2011 by Parallels Holdings, Ltd. All rights reserved.

Copyright 1999-2011 by Parallels Holdings, Ltd. All rights reserved. Parallels Virtuozzo Containers 4.0 for Linux Readme Copyright 1999-2011 by Parallels Holdings, Ltd. All rights reserved. This document provides the first-priority information on Parallels Virtuozzo Containers

More information

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB Planet Size Data!? Gartner s 10 key IT trends for 2012 unstructured data will grow some 80% over the course of the next

More information

Can the Elephants Handle the NoSQL Onslaught?

Can the Elephants Handle the NoSQL Onslaught? Can the Elephants Handle the NoSQL Onslaught? Avrilia Floratou, Nikhil Teletia David J. DeWitt, Jignesh M. Patel, Donghui Zhang University of Wisconsin-Madison Microsoft Jim Gray Systems Lab Presented

More information