Processing Joins over Big Data in MapReduce
|
|
|
- Neil Brooks
- 10 years ago
- Views:
Transcription
1 Processing Joins over Big Data in MapReduce Christos Doulkeridis Department of Digital Systems School of Information and Communication Technologies University of Piraeus 12 th Hellenic Data Management Symposium (HDMS 14), 24 July 2014
2 Aims and Scope Provide an overview of join processing in MapReduce/Hadoop, focusing on complex join types, besides equi-joins e.g. theta-join, similarity join, top-k join, k-nn join On top of MapReduce Binary joins Identify common techniques (at high level) that can be used as building blocks when designing a parallel join algorithm 2
3 Not Covered in this Tutorial Distributed platforms for Big Data management other than Hadoop (e.g. Storm, NoSQL stores, ) Approaches that rely on changing the internal mechanism of Hadoop (e.g. changing data layouts, loopaware Hadoop, etc.) Multi-way joins (e.g. check out the work by Afrati et al.) Check out related tutorials MapReduce Algorithms for Big Data Analytics [K.Shim@VLDB12] Efficient Big Data Processing in Hadoop MapReduce [J.Dittrich@VLDB12] 3
4 Related Research Projects CloudIX: Cloud-based Indexing and Query Processing Funding program: MC-IEF, Roadrunner: Scalable and Efficient Analytics for Big Data Funding program: ΑΡΙΣΤΕΙΑ ΙΙ, ΓΓΕΤ, Open positions: Post-doc, Phd/Msc 4
5 MapReduce basics Outline Map-side and Reduce-side join Join types and algorithms Equi-joins Theta-joins Similarity joins Top-k joins knn joins Important phases of join algorithms Pre-processing, pre-filtering, partitioning, replication, load balancing Summary & open research directions 5
6 MapReduce Basics 6
7 MapReduce vs. RDBMS Source: T.White Hadoop: The Definitive Guide 7
8 Hadoop Distributed File System (HDFS) 8 Source:
9 MapReduce A programming framework for parallel and scalable processing of huge datasets A job in MapReduce is defined as two separate phases Function Map (also called mapper) Function Reduce (also called reducer) Data is represented as key-value pairs Map (k1, v1) List (k2, v2) Reduce (k2, List(v2)) List (k3, v3) The data output by the Map phase are partitioned and sort-merged in order to reach the machines that compute the Reduce phase This intermediate phase is called Shuffle 9
10 Doc1 Doc2 Doc3 Doc4 Doc5 Brazil, Germany, Argentine Chile, Germany, Brazil Greece, Japan, Brazil Chile, Germany, Chile, Brazil, Germany M M KEY VALUE Brazil 1 Germany KEY 1 VALUE Argentine Brazil 1 1,1,1,1 Chile Chile 1 1,1,1 Germany 1 Brazil 1 KEY VALUE KEY VALUE Greece 1 Argentine 1 Japan 1 Germany 1,1,1,1 Brazil 1 Chile 1 Germany 1 R R KEY VALUE Brazil 4 Chile 3 KEY VALUE Argentine 1 Germany 4 Chile 1 Brazil 1 M Germany 1 10
11 Implementation (High Level) mapper (filename, file-contents): for each word in file-contents: output (word, 1) reducer (word, values): sum = 0 for each value in values: sum = sum + value output(word, sum) 11
12 MapReduce/Hadoop: 6 Basic Steps Source: C.Doulkeridis and K.Nørvåg. A Survey of Large-Scale Analytical Query Processing in MapReduce. In the VLDB Journal, June
13 1. Input Reader Splits Reads data from files and transforms them to key-value pairs It is possible to support different input sources, such as a database or main-memory Doc1 Doc2 Brazil, Germany, Argentine Chile, Germany, Brazil M The data form splits which is the unit of data handled by a map task A typical size of a split is the size of a block (default: 64MB in HDFS, but is customizable) Doc3 Doc4 Doc5 Greece, Japan, Brazil Chile, Germany, Chile, Brazil, Germany M 13
14 2. Map Function Takes as input a key-value pair from the Input Reader Runs the code (logic) of the Map function on the pair Produces as result a new key-value pair The results of the Map function are placed in a buffer in memory, and written on disk when full (e.g. 80%) spill files The files are merged in one sorted file 14
15 3. Combiner Function Its use is optional Used when The Map function produces many multiple intermediate keys The Reduce function is commutative (a + b = b + a) and associative (a + b) + c = a + (b + c) In this case, the Combiner function performs partial merging so that pairs with the same key will be processed as a group by a Reduce task KEY VALUE Brazil 1 KEY VALUE M Germany 1 Argentine 1 Chile 1 C Brazil 2 Germany 2 Argentine 1 R Germany 1 Chile 1 15 Brazil 1
16 4. Partition Function Default: a hash function is used to partition the result of Map tasks to Reduce tasks Usually, works well for load balancing However, it is often useful to employ other partition functions Such a function can be user-defined Brazil Germany Argentine Chile Greece R R 16
17 5. Reduce Function The Reduce function is called once for each discrete key and is applied to all values associated with the key All pairs with the same key will be processed as a group The input to each Reduce task is given in increased key order It is possible for the user to define a comparator which will be used for sorting 17
18 6. Output Writer Responsible for writing the output to secondary storage (disk) Usually, this is a file However, it is possible to modify the function to store data, e.g., in a database 18
19 MapReduce Advantages Scalability Fault-tolerance Flexibility Ease of programming 19
20 10 Weaknesses/Limitations of MapReduce 1. Efficient access to data 2. High communication cost + Systematic classification of existing work until 2012 (over 110 references) 3. Redundant processing + Comparison of offered features 4. Recomputation + Summary of techniques used 5. Lack of early termination 6. Lack of iteration 7. Fast computation of approximate results 8. Load balancing 9. Real-time processing 10. Lack of support for operators with multiple inputs Source: C.Doulkeridis and K.Nørvåg. A Survey of Large-Scale Analytical Query Processing in MapReduce. In the VLDB Journal, June
21 M-way Operations in MapReduce Not naturally supported Other systems support such operations by design (e.g., Storm) How in MapReduce? Two families of techniques (high level classification) Map-side join Reduce-side join 21
22 Map-side Join L Uid Time Action 1 19:45 Update L uid=uidr 1 21:12 Read 2 21:11 Read 2 23:45 Write M Uname cdoulk cdoulk Action Update Read R 3 23:09 Insert Uid Uname Status 1 cdoulk Lecturer 2 jimper0 Postgrad 3 spirosoik Postgrad M M Uname Jimper0 Jimper0 Uname Spirosoik Action Read Write Action Insert 22
23 Map-side Join Operates on the map side, no reduce phase Requirements for each input Divided into the same number of partitions Sorted by the join key Has the same number of keys Advantages Very efficient (no intermediate data, no shuffling, simply scan and join) Disadvantages Very strict requirements, in practice an extra MR job is necessary to prepare the inputs High required memory (buffers both input partitions) Source: T.White Hadoop: The Definitive Guide 23
24 L Uid Time Action 1 19:45 Update 2 21:11 Read 1 21:12 Read 3 23:09 Insert 2 23:45 Write R Uid Uname Status 1 cdoulk Lecturer 2 jimper0 Postgrad 3 spirosoik Postgrad Reduce-side Join M M M M M 1,L Update 1,L Read 1,R cdoulk 2,L Read 2,L Write 2,R jimper0 3,L Insert 3,R spirosoik R R R L uid=uidr Uname Action cdoulk Update cdoulk Read Uname Action Jimper0 Read Jimper0 Write Uname Action Spirosoik Insert 24
25 Reduce-side Join a.k.a. repartition join The join computation is performed on the reduce side Advantages The most general method (similar to a traditional parallel join) Smaller memory footprint (only tuples of one dataset with the same key) Disadvantages Cost of shuffling, I/O and communication costs for transferring data to reducers High memory requirements for skewed data Source: T.White Hadoop: The Definitive Guide 25
26 Join Types and Algorithms 26
27 Example: Equi-Join L Uid Time Action 1 19:45 Update 2 21:11 Read 1 21:12 Read 3 23:09 Insert 2 23:45 Write R Uid Uname Status 1 cdoulk Lecturer 2 jimper0 Postgrad 3 spirosoik Postgrad L uid=uidr 27
28 Equi-Joins: Broadcast Join Map-only job Assumes that R << L DFS R L1 L2 L3 R R R M M M L1 R L2 R L3 R Blanas et al. SIGMOD 10 28
29 Equi-Joins: Broadcast Join Advantages Efficient: only map-phase Load the small dataset R in hash table: faster probes No pre-processing (e.g., sorting) Disadvantages One dataset must be quite small Fit in memory Distribute to all mappers 29
30 Equi-Joins: Semi-join R contains only those keys in L.uid L1 L1 L2 M M R Distinct keys L.uid R1 R2 M M R 1 R 2 R L2 M M L1 R L2 R L3 M R3 M R 3 L3 M Broadcast Join L3 R Blanas et al. SIGMOD 10 30
31 Equi-Joins: Semi-join Advantages When R is large, reduces size of intermediate data and communication costs Useful when many tuples of one dataset do not join with tuples of the other dataset Disadvantages Three MR jobs overheads: job initialization, communication, local and remote I/O Multiple accesses of datasets 31
32 Equi-Joins: Per-Split Semi-join Distinct keys Mappers output records of R tagged with RLi L1 L1 L2 M M L1.uid L2.uid R1 R2 M M R RL1 RL2 L2 M M L1 R L2 R L3 M L3.uid R3 M R RL3 L3 M L3 R Join Blanas et al. SIGMOD 10 32
33 Equi-Joins: Per-Split Semi-join Advantages (compared to Semi-join) Makes the 3 rd phase cheaper, as it moves only the records in R that will join with each split of L Disadvantages (compared to Semi-join) First two phases are more complicated 33
34 Other Approaches for Equi-Joins Map-Reduce-Merge [Yang et al. SIGMOD 07] Map-Join-Reduce [Jiang et al. TKDE 11] Llama [Lin et al. SIGMOD 11] Multi-way Equi-Join [Afrati et al. TKDE 11] 34
35 Example: Theta-Join S Uid Time Action 1 19:45 Update 2 21:11 Read 1 21:12 Read 3 23:09 Insert 2 23:45 Write S T Uid Uname Status 1 cdoulk Lecturer 2 jimper0 Postgrad 3 spirosoik Postgrad S.uid<=T.uidT Challenge: How to map the inequality join condition to a key-equality based computer paradigm 35
36 Join Matrix Representation M(i,j)=true, if the i-th tuple of S and the j-th tuple of T satisfy the join condition Problem: find a mapping of join matrix cells to reducers that minimizes job completion time Each join output tuple should be produced by exactly one reducer 36 Okcan et al. SIGMOD 11
37 Matrix-to-Reducer Mapping Standard Random Balanced Max-reducer-input=5 Max-reducer-output=6 Max-reducer-input=8 Max-reducer-output=4 Max-reducer-input=5 Max-reducer-output=4 Okcan et al. SIGMOD 11 37
38 MapReduce Algorithm: 1-Bucket-Theta Given a matrix-to-reducer mapping For each incoming S-tuple, say x S Assign a random matrix row from 1 S For all regions that correspond to this row Output (regionid, (x, S )) For each incoming T-tuple, say y T Assign a random matrix column from 1 T For all regions that correspond to this column Output (regionid, (y, T )) Okcan et al. SIGMOD 11 38
39 In the paper How to perform near-optimal partitioning with strong optimality guarantees For cross-product SxT More efficient algorithms (M-Bucket-I, M-Bucket- O) when more statistics are available Okcan et al. SIGMOD 11 39
40 Other Approaches for Theta-Joins Multi-way Theta-joins [Zhang et al.vldb 12] Binary Theta-joins [Koumarelas et al.edbtw 14] 40
41 Example: (Set) Similarity Join L Rid a b 1 A B C 2 D E F 3 C F 4 E C D 5 B A F R Rid a c 1 G H I 2 A F G 3 B D F L sim(l.a,r.a)>τ R Example (strings): S1: I will call back [I, will, call, back] S2: I will call you soon [I, will, call, you, soon] e.g. sim(): Jaccard, τ = 0.4 sim(s1,s2) = 3 / 6 = 0.5 >
42 Background: Set Similarity Joins Rely on filters for efficiency, such as prefix filtering Global token ordering, sort tokens based on increasing frequency order Jaccard(s1,s2)>=τ Overlap(s1,s2)>=a a = τ / (1+ τ) * ( s1 + s2 ) Similar records need to share at least one common token in their prefixes Example: s1={a,b,c,d,e} Prefix length of s1: s1 - ceil( τ* s1 ) +1 For τ=0.9: 5-ceil(0.9*5)+1= 1 Prefix of s1 is {A} For τ=0.8: 5-ceil(0.8*5)+1= 2 Prefix of s1 is {A,B} For τ=0.6: 5-ceil(0.6*5)+1= 3 Prefix of s1 is {A,B,C} 42
43 Set Similarity Joins Solves the problem of Set Similarity Self- Join in 3 stages Can be implemented as 3 (or more) MR jobs #1: Token Ordering #2: RID-pair Generation #3: Record Join Vernica et al. SIGMOD 10 43
44 Set Similarity Joins Stage 1: Token Ordering Vernica et al. SIGMOD 10 44
45 Set Similarity Joins Stage 2: RID-pair Generation Vernica et al. SIGMOD 10 45
46 Set Similarity Joins Stage 3: Record Join Vernica et al. SIGMOD 10 46
47 Other Approaches for Similarity Joins V-SMART-Join [Metwally et al.vldb 12] SSJ-2R [Baraglia et al. ICDM 10] Silva et al. [Cloud-I 12] Top-k similarity join [Kim et al. ICDE 12] Fuzzy join [Afrati et al. ICDE 12] 47
48 Example: Top-k Join L Supplier Product Discount S1 Monitor 100 S2 CPU 160 S3 CPU 120 S4 HDD 190 S5 HDD 90 S6 DVD-RW 200 R Customer Product Price C1 Monitor 800 C2 CPU 240 C3 HDD 500 C4 CPU 550 C5 HDD 700 C6 Monitor 600 Topk(L L.product=R.productR) 48
49 RanKloud Basic ideas Generate statistics at runtime (adaptive sampling) Calculate threshold that identifies candidate tuples for inclusion in the top-k set Repartition data to achieve load balancing Cannot guarantee completeness May produce fewer than k results Probably requires 2 MapReduce jobs Candan et al. IEEE Multimedia
50 Top-k Joins: Our Approach Basic features Exploit histograms built seamlessly during data upload to HDFS Compute bound Access only part of the input data (Map phase) Perform the top-k join in a load-balanced way (Reduce phase) working paper Doulkeridis et al. ICDE 12, CloudI 12@VLDB 50
51 Other Approaches for Top-k Joins BFHM Rank-Join in HBase [Ntarmos et al.vldb 14] Tomorrow, HDMS session 5, 10:50-12:30 51
52 Example: knn-join R Rid X Y S Sid X Y knnj(r,s)={(r, knn(r,s)) for all r R} 52
53 knn Joins in MapReduce Straightforward solution: Block nested loop join (H-BNLJ) 1 st MapReduce job Partition R and S in n equi-sized blocks Every pair of blocks is partitioned into a bucket (n 2 buckets) Invoke r=n 2 reducers to do knn join (nk candidates for each record) 2 nd MapReduce job Partition output records of 1 st phase by record ID In the reduce phase, sort in ascending order of distance Then, report top-k results for each record Improvement (H-BRJ): Build an index for the local S block in the reducer, to find knn efficiently Zhang et al. EDBT 12 53
54 H-zkNNJ Exact solution is too costly! go for approximate solution instead Solves the problem in 3 phases #1: Construction of shifted copies of R and S #2: Compute candidate k-nn for each record r R #3: Determine actual k-nn for each r Zhang et al. EDBT 12 54
55 H-zkNNJ: Phase 1 Stored locally Estimated quantiles A[0] A[n] for partitioning Ri, Si Zhang et al. EDBT 12 55
56 H-zkNNJ: Phase 2 α*n reducers Zhang et al. EDBT 12 56
57 H-zkNNJ: Phase 3 Output of Phase 2: For each record r R: C(r) candidates Compute knn(r, C(r)) trivial Zhang et al. EDBT 12 57
58 Other Approaches for knn Joins [Lu et al.vldb 12] Partitioning based on Voronoi diagrams 58
59 Important Phases of Join Algorithms How to boost the speed of my join algorithm? 59
60 Pre-processing Includes techniques such as Ordering [Vernica et al. SIGMOD 10] Global token ordering Sampling [Kim et al. ICDE 12] Upper bound on the distance of k-th closest pair Frequency computation [Metwally et al.vldb 12] Partial results and cardinalities of items Columnar storage, sorted [Lin et al. SIGMOD 11] Enables map-side join processing 60
61 Pre-filtering Reduces the candidate input tuples to be processed Essential pairs [Kim et al. ICDE 12] Partitioning only those pairs necessary for producing top-k most similar pairs 61
62 Partitioning More advanced partitioning schemes than hashing (application-specific) Theta-joins [Okcan et al. SIGMOD 11] Map the join matrix to the reduce tasks knn joins [Lu et al. VLDB 12] Z-ordering for 1-dimensional mapping Range partitioning Top-k joins Partitioning schemes tailored to top-k queries Utility-aware partitioning [RanKloud] Angle-based partitioning [Doulkeridis et al. CloudI 12] 62
63 Replication/Duplication In certain parallel join algorithms, replication of records is unavoidable How can we control the amount of replication? Set-similarity join [Vernica et al. SIGMOD 10] Grouped tokens Multi-way theta joins [Zhang et al.vldb 12] Minimize duplication of records to partitions Use a multidimensional space partitioning method to assign partitions of the join space to reducers 63
64 Load Balancing Data skew causes unbalanced allocation of work to reducers (lack of a built-in data-aware load balancing mechanism) Frequency-based assignment of keys to reducers [Vernica et al. SIGMOD 10] Cardinality-based assignment [Metwally et al.vldb 12] Minimize job completion time [Okcan et al. SIGMOD 11] Approximate quantiles [Lu et al. VLDB 12] Partition the input relations to equi-sized partitions 64
65 Analysis of Join Processing Source: C.Doulkeridis and K.Nørvåg. A Survey of Large-Scale Analytical Query Processing in MapReduce. In the VLDB Journal, June
66 Summary and Open Research Directions 66
67 Summary Processing joins over Big Data in MapReduce is a challenging task Especially for complex join types, large-scale data, skewed datasets, However, focus on the techniques(!), not the platform New techniques are required Old techniques (parallel joins) need to be revisited to accommodate the needs of new platforms (such as MapReduce) Window of opportunity to conduct research in an interesting topic 67
68 Trends Scalable and efficient platforms MapReduce improvements, new platforms (lots of research projects worldwide), etc. Declarative querying instead of programming language Hive, Pig (over MapReduce), Trident (over Storm) Real-time and interactive processing Data visualization Main-memory databases 68
69 References [Blanas et al. SIGMOD 10] A comparison of join algorithms for log processing in MapReduce. [Yang et al. SIGMOD 07] Map-reduce-merge: simplied relational data processing on large clusters. [Jiang et al. TKDE 11] MAP-JOIN-REDUCE: Toward scalable and efficient data analysis on large clusters. [Lin et al. SIGMOD 11] Llama: leveraging columnar storage for scalable join processing in the MapReduce framework. [Okcan et al. SIGMOD 11] Processing Theta-Joins using MapReduce. [Zhang et al.vldb 12] Efficient Multi-way Theta-Join Processing using MapReduce. [Koumarelas et al.edbtw 14] Binary Theta-Joins using MapReduce: Efficiency Analysis and Improvements. [Candan et al. IEEE Multimedia 2011] RanKloud: Scalable Multimedia Data Processing in Server Clusters. [Doulkeridis et al. CloudI 12] On Saying "Enough Already!" in MapReduce. [Ntarmos et al. VLDB 14] Rank Join Queries in NoSQL Databases [Zhang et al. EDBT 12] Efficient Parallel knn Joins for Large Data in MapReduce [Lu et al.vldb 12] Efficient Processing of k Nearest Neighbor Joins using MapReduce 69
70 Thank you for your attention More info: Contact:
Big Data Technology Map-Reduce Motivation: Indexing in Search Engines
Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Edward Bortnikov & Ronny Lempel Yahoo Labs, Haifa Indexing in Search Engines Information Retrieval s two main stages: Indexing process
Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel
Parallel Databases Increase performance by performing operations in parallel Parallel Architectures Shared memory Shared disk Shared nothing closely coupled loosely coupled Parallelism Terminology Speedup:
On Saying Enough Already! in MapReduce
On Saying Enough Already! in MapReduce Christos Doulkeridis and Kjetil Nørvåg Department of Computer and Information Science (IDI) Norwegian University of Science and Technology (NTNU) Trondheim, Norway
Similarity Search in a Very Large Scale Using Hadoop and HBase
Similarity Search in a Very Large Scale Using Hadoop and HBase Stanislav Barton, Vlastislav Dohnal, Philippe Rigaux LAMSADE - Universite Paris Dauphine, France Internet Memory Foundation, Paris, France
BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic
BigData An Overview of Several Approaches David Mera Masaryk University Brno, Czech Republic 16/12/2013 Table of Contents 1 Introduction 2 Terminology 3 Approaches focused on batch data processing MapReduce-Hadoop
16.1 MAPREDUCE. For personal use only, not for distribution. 333
For personal use only, not for distribution. 333 16.1 MAPREDUCE Initially designed by the Google labs and used internally by Google, the MAPREDUCE distributed programming model is now promoted by several
A very short Intro to Hadoop
4 Overview A very short Intro to Hadoop photo by: exfordy, flickr 5 How to Crunch a Petabyte? Lots of disks, spinning all the time Redundancy, since disks die Lots of CPU cores, working all the time Retry,
Big Data With Hadoop
With Saurabh Singh [email protected] The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials
Implementation and Analysis of Join Algorithms to handle skew for the Hadoop Map/Reduce Framework
Implementation and Analysis of Join Algorithms to handle skew for the Hadoop Map/Reduce Framework Fariha Atta MSc Informatics School of Informatics University of Edinburgh 2010 T Abstract he Map/Reduce
MapReduce Join Strategies for Key-Value Storage
MapReduce Join Strategies for Key-Value Storage Duong Van Hieu, Sucha Smanchat, and Phayung Meesad Faculty of Information Technology King Mongkut s University of Technology North Bangkok Bangkok 10800,
Efficient Parallel knn Joins for Large Data in MapReduce
Efficient Parallel knn Joins for Large Data in MapReduce Chi Zhang Feifei Li 2 Jeffrey Jestes 2 Computer Science Department Florida State University [email protected] 2 School of Computing University of
Developing MapReduce Programs
Cloud Computing Developing MapReduce Programs Dell Zhang Birkbeck, University of London 2015/16 MapReduce Algorithm Design MapReduce: Recap Programmers must specify two functions: map (k, v) * Takes
CSE-E5430 Scalable Cloud Computing Lecture 2
CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University [email protected] 14.9-2015 1/36 Google MapReduce A scalable batch processing
Distributed Computing and Big Data: Hadoop and MapReduce
Distributed Computing and Big Data: Hadoop and MapReduce Bill Keenan, Director Terry Heinze, Architect Thomson Reuters Research & Development Agenda R&D Overview Hadoop and MapReduce Overview Use Case:
brief contents PART 1 BACKGROUND AND FUNDAMENTALS...1 PART 2 PART 3 BIG DATA PATTERNS...253 PART 4 BEYOND MAPREDUCE...385
brief contents PART 1 BACKGROUND AND FUNDAMENTALS...1 1 Hadoop in a heartbeat 3 2 Introduction to YARN 22 PART 2 DATA LOGISTICS...59 3 Data serialization working with text and beyond 61 4 Organizing and
Big Data Processing with Google s MapReduce. Alexandru Costan
1 Big Data Processing with Google s MapReduce Alexandru Costan Outline Motivation MapReduce programming model Examples MapReduce system architecture Limitations Extensions 2 Motivation Big Data @Google:
A Survey of Large-Scale Analytical Query Processing in MapReduce
The VLDB Journal manuscript No. (will be inserted by the editor) A Survey of Large-Scale Analytical Query Processing in MapReduce Christos Doulkeridis Kjetil Nørvåg Received: date / Accepted: date Abstract
Large-Scale Data Sets Clustering Based on MapReduce and Hadoop
Journal of Computational Information Systems 7: 16 (2011) 5956-5963 Available at http://www.jofcis.com Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Ping ZHOU, Jingsheng LEI, Wenjun YE
Comparison of Different Implementation of Inverted Indexes in Hadoop
Comparison of Different Implementation of Inverted Indexes in Hadoop Hediyeh Baban, S. Kami Makki, and Stefan Andrei Department of Computer Science Lamar University Beaumont, Texas (hbaban, kami.makki,
Introduction to Hadoop
Introduction to Hadoop Miles Osborne School of Informatics University of Edinburgh [email protected] October 28, 2010 Miles Osborne Introduction to Hadoop 1 Background Hadoop Programming Model Examples
Big Data and Scripting map/reduce in Hadoop
Big Data and Scripting map/reduce in Hadoop 1, 2, parts of a Hadoop map/reduce implementation core framework provides customization via indivudual map and reduce functions e.g. implementation in mongodb
Load-Balancing the Distance Computations in Record Linkage
Load-Balancing the Distance Computations in Record Linkage Dimitrios Karapiperis Vassilios S. Verykios Hellenic Open University School of Science and Technology Patras, Greece {dkarapiperis, verykios}@eap.gr
Classification On The Clouds Using MapReduce
Classification On The Clouds Using MapReduce Simão Martins Instituto Superior Técnico Lisbon, Portugal [email protected] Cláudia Antunes Instituto Superior Técnico Lisbon, Portugal [email protected]
ITG Software Engineering
Introduction to Cloudera Course ID: Page 1 Last Updated 12/15/2014 Introduction to Cloudera Course : This 5 day course introduces the student to the Hadoop architecture, file system, and the Hadoop Ecosystem.
Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13
Systems Infrastructure for Data Science Web Science Group Uni Freiburg WS 2012/13 Hadoop Ecosystem Overview of this Lecture Module Background Google MapReduce The Hadoop Ecosystem Core components: Hadoop
Cloud Computing at Google. Architecture
Cloud Computing at Google Google File System Web Systems and Algorithms Google Chris Brooks Department of Computer Science University of San Francisco Google has developed a layered system to handle webscale
MapReduce Jeffrey Dean and Sanjay Ghemawat. Background context
MapReduce Jeffrey Dean and Sanjay Ghemawat Background context BIG DATA!! o Large-scale services generate huge volumes of data: logs, crawls, user databases, web site content, etc. o Very useful to be able
Distributed Apriori in Hadoop MapReduce Framework
Distributed Apriori in Hadoop MapReduce Framework By Shulei Zhao (sz2352) and Rongxin Du (rd2537) Individual Contribution: Shulei Zhao: Implements centralized Apriori algorithm and input preprocessing
Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase
Architectural patterns for building real time applications with Apache HBase Andrew Purtell Committer and PMC, Apache HBase Who am I? Distributed systems engineer Principal Architect in the Big Data Platform
Infrastructures for big data
Infrastructures for big data Rasmus Pagh 1 Today s lecture Three technologies for handling big data: MapReduce (Hadoop) BigTable (and descendants) Data stream algorithms Alternatives to (some uses of)
Chapter 7. Using Hadoop Cluster and MapReduce
Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in
FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data
FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data Miguel Liroz-Gistau, Reza Akbarinia, Patrick Valduriez To cite this version: Miguel Liroz-Gistau, Reza Akbarinia, Patrick Valduriez. FP-Hadoop:
RECOMMENDATION SYSTEM USING BLOOM FILTER IN MAPREDUCE
RECOMMENDATION SYSTEM USING BLOOM FILTER IN MAPREDUCE Reena Pagare and Anita Shinde Department of Computer Engineering, Pune University M. I. T. College Of Engineering Pune India ABSTRACT Many clients
This exam contains 13 pages (including this cover page) and 18 questions. Check to see if any pages are missing.
Big Data Processing 2013-2014 Q2 April 7, 2014 (Resit) Lecturer: Claudia Hauff Time Limit: 180 Minutes Name: Answer the questions in the spaces provided on this exam. If you run out of room for an answer,
Using distributed technologies to analyze Big Data
Using distributed technologies to analyze Big Data Abhijit Sharma Innovation Lab BMC Software 1 Data Explosion in Data Center Performance / Time Series Data Incoming data rates ~Millions of data points/
Hadoop Ecosystem B Y R A H I M A.
Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open
Report: Declarative Machine Learning on MapReduce (SystemML)
Report: Declarative Machine Learning on MapReduce (SystemML) Jessica Falk ETH-ID 11-947-512 May 28, 2014 1 Introduction SystemML is a system used to execute machine learning (ML) algorithms in HaDoop,
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA http://kzhang6.people.uic.edu/tutorial/amcis2014.html August 7, 2014 Schedule I. Introduction to big data
The Stratosphere Big Data Analytics Platform
The Stratosphere Big Data Analytics Platform Amir H. Payberah Swedish Institute of Computer Science [email protected] June 4, 2014 Amir H. Payberah (SICS) Stratosphere June 4, 2014 1 / 44 Big Data small data
Implement Hadoop jobs to extract business value from large and varied data sets
Hadoop Development for Big Data Solutions: Hands-On You Will Learn How To: Implement Hadoop jobs to extract business value from large and varied data sets Write, customize and deploy MapReduce jobs to
Map Reduce / Hadoop / HDFS
Chapter 3: Map Reduce / Hadoop / HDFS 97 Overview Outline Distributed File Systems (re-visited) Motivation Programming Model Example Applications Big Data in Apache Hadoop HDFS in Hadoop YARN 98 Overview
Hadoop Job Oriented Training Agenda
1 Hadoop Job Oriented Training Agenda Kapil CK [email protected] Module 1 M o d u l e 1 Understanding Hadoop This module covers an overview of big data, Hadoop, and the Hortonworks Data Platform. 1.1 Module
Accelerating Hadoop MapReduce Using an In-Memory Data Grid
Accelerating Hadoop MapReduce Using an In-Memory Data Grid By David L. Brinker and William L. Bain, ScaleOut Software, Inc. 2013 ScaleOut Software, Inc. 12/27/2012 H adoop has been widely embraced for
International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 ISSN 2278-7763
International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 A Discussion on Testing Hadoop Applications Sevuga Perumal Chidambaram ABSTRACT The purpose of analysing
Duke University http://www.cs.duke.edu/starfish
Herodotos Herodotou, Harold Lim, Fei Dong, Shivnath Babu Duke University http://www.cs.duke.edu/starfish Practitioners of Big Data Analytics Google Yahoo! Facebook ebay Physicists Biologists Economists
From GWS to MapReduce: Google s Cloud Technology in the Early Days
Large-Scale Distributed Systems From GWS to MapReduce: Google s Cloud Technology in the Early Days Part II: MapReduce in a Datacenter COMP6511A Spring 2014 HKUST Lin Gu [email protected] MapReduce/Hadoop
Introduction to NoSQL Databases and MapReduce. Tore Risch Information Technology Uppsala University 2014-05-12
Introduction to NoSQL Databases and MapReduce Tore Risch Information Technology Uppsala University 2014-05-12 What is a NoSQL Database? 1. A key/value store Basic index manager, no complete query language
Big Systems, Big Data
Big Systems, Big Data When considering Big Distributed Systems, it can be noted that a major concern is dealing with data, and in particular, Big Data Have general data issues (such as latency, availability,
Exploring the Efficiency of Big Data Processing with Hadoop MapReduce
Exploring the Efficiency of Big Data Processing with Hadoop MapReduce Brian Ye, Anders Ye School of Computer Science and Communication (CSC), Royal Institute of Technology KTH, Stockholm, Sweden Abstract.
A Brief Outline on Bigdata Hadoop
A Brief Outline on Bigdata Hadoop Twinkle Gupta 1, Shruti Dixit 2 RGPV, Department of Computer Science and Engineering, Acropolis Institute of Technology and Research, Indore, India Abstract- Bigdata is
Radoop: Analyzing Big Data with RapidMiner and Hadoop
Radoop: Analyzing Big Data with RapidMiner and Hadoop Zoltán Prekopcsák, Gábor Makrai, Tamás Henk, Csaba Gáspár-Papanek Budapest University of Technology and Economics, Hungary Abstract Working with large
Jeffrey D. Ullman slides. MapReduce for data intensive computing
Jeffrey D. Ullman slides MapReduce for data intensive computing Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very
Hadoop SNS. renren.com. Saturday, December 3, 11
Hadoop SNS renren.com Saturday, December 3, 11 2.2 190 40 Saturday, December 3, 11 Saturday, December 3, 11 Saturday, December 3, 11 Saturday, December 3, 11 Saturday, December 3, 11 Saturday, December
Analysis of MapReduce Algorithms
Analysis of MapReduce Algorithms Harini Padmanaban Computer Science Department San Jose State University San Jose, CA 95192 408-924-1000 [email protected] ABSTRACT MapReduce is a programming model
Lecture 10 - Functional programming: Hadoop and MapReduce
Lecture 10 - Functional programming: Hadoop and MapReduce Sohan Dharmaraja Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 1 / 41 For today Big Data and Text analytics Functional
Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh
1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets
INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE
INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE AGENDA Introduction to Big Data Introduction to Hadoop HDFS file system Map/Reduce framework Hadoop utilities Summary BIG DATA FACTS In what timeframe
DATA MINING WITH HADOOP AND HIVE Introduction to Architecture
DATA MINING WITH HADOOP AND HIVE Introduction to Architecture Dr. Wlodek Zadrozny (Most slides come from Prof. Akella s class in 2014) 2015-2025. Reproduction or usage prohibited without permission of
How To Write A Paper On Bloom Join On A Distributed Database
Research Paper BLOOM JOIN FINE-TUNES DISTRIBUTED QUERY IN HADOOP ENVIRONMENT Dr. Sunita M. Mahajan 1 and Ms. Vaishali P. Jadhav 2 Address for Correspondence 1 Principal, Mumbai Education Trust, Bandra,
SQL Query Evaluation. Winter 2006-2007 Lecture 23
SQL Query Evaluation Winter 2006-2007 Lecture 23 SQL Query Processing Databases go through three steps: Parse SQL into an execution plan Optimize the execution plan Evaluate the optimized plan Execution
Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January 2015. Email: [email protected] Website: www.qburst.com
Lambda Architecture Near Real-Time Big Data Analytics Using Hadoop January 2015 Contents Overview... 3 Lambda Architecture: A Quick Introduction... 4 Batch Layer... 4 Serving Layer... 4 Speed Layer...
COURSE CONTENT Big Data and Hadoop Training
COURSE CONTENT Big Data and Hadoop Training 1. Meet Hadoop Data! Data Storage and Analysis Comparison with Other Systems RDBMS Grid Computing Volunteer Computing A Brief History of Hadoop Apache Hadoop
An efficient Join-Engine to the SQL query based on Hive with Hbase Zhao zhi-cheng & Jiang Yi
International Conference on Applied Science and Engineering Innovation (ASEI 2015) An efficient Join-Engine to the SQL query based on Hive with Hbase Zhao zhi-cheng & Jiang Yi Institute of Computer Forensics,
Chapter 13: Query Processing. Basic Steps in Query Processing
Chapter 13: Query Processing! Overview! Measures of Query Cost! Selection Operation! Sorting! Join Operation! Other Operations! Evaluation of Expressions 13.1 Basic Steps in Query Processing 1. Parsing
Intro to Map/Reduce a.k.a. Hadoop
Intro to Map/Reduce a.k.a. Hadoop Based on: Mining of Massive Datasets by Ra jaraman and Ullman, Cambridge University Press, 2011 Data Mining for the masses by North, Global Text Project, 2012 Slides by
Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data
Case Study 2: Document Retrieval Parallel Programming Map-Reduce Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Carlos Guestrin January 31 st, 2013 Carlos Guestrin
Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13
Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13 Astrid Rheinländer Wissensmanagement in der Bioinformatik What is Big Data? collection of data sets so large and complex
NoSQL and Hadoop Technologies On Oracle Cloud
NoSQL and Hadoop Technologies On Oracle Cloud Vatika Sharma 1, Meenu Dave 2 1 M.Tech. Scholar, Department of CSE, Jagan Nath University, Jaipur, India 2 Assistant Professor, Department of CSE, Jagan Nath
Cloudera Certified Developer for Apache Hadoop
Cloudera CCD-333 Cloudera Certified Developer for Apache Hadoop Version: 5.6 QUESTION NO: 1 Cloudera CCD-333 Exam What is a SequenceFile? A. A SequenceFile contains a binary encoding of an arbitrary number
Maximizing Hadoop Performance and Storage Capacity with AltraHD TM
Maximizing Hadoop Performance and Storage Capacity with AltraHD TM Executive Summary The explosion of internet data, driven in large part by the growth of more and more powerful mobile devices, has created
High Performance Spatial Queries and Analytics for Spatial Big Data. Fusheng Wang. Department of Biomedical Informatics Emory University
High Performance Spatial Queries and Analytics for Spatial Big Data Fusheng Wang Department of Biomedical Informatics Emory University Introduction Spatial Big Data Geo-crowdsourcing:OpenStreetMap Remote
BIG DATA What it is and how to use?
BIG DATA What it is and how to use? Lauri Ilison, PhD Data Scientist 21.11.2014 Big Data definition? There is no clear definition for BIG DATA BIG DATA is more of a concept than precise term 1 21.11.14
Introduction to Hadoop. New York Oracle User Group Vikas Sawhney
Introduction to Hadoop New York Oracle User Group Vikas Sawhney GENERAL AGENDA Driving Factors behind BIG-DATA NOSQL Database 2014 Database Landscape Hadoop Architecture Map/Reduce Hadoop Eco-system Hadoop
PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS
PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS By HAI JIN, SHADI IBRAHIM, LI QI, HAIJUN CAO, SONG WU and XUANHUA SHI Prepared by: Dr. Faramarz Safi Islamic Azad
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after
MAPREDUCE Programming Model
CS 2510 COMPUTER OPERATING SYSTEMS Cloud Computing MAPREDUCE Dr. Taieb Znati Computer Science Department University of Pittsburgh MAPREDUCE Programming Model Scaling Data Intensive Application MapReduce
Vertica Live Aggregate Projections
Vertica Live Aggregate Projections Modern Materialized Views for Big Data Nga Tran - HPE Vertica - [email protected] November 2015 Outline What is Big Data? How Vertica provides Big Data Solutions? What
Comparing SQL and NOSQL databases
COSC 6397 Big Data Analytics Data Formats (II) HBase Edgar Gabriel Spring 2015 Comparing SQL and NOSQL databases Types Development History Data Storage Model SQL One type (SQL database) with minor variations
Map Reduce & Hadoop Recommended Text:
Big Data Map Reduce & Hadoop Recommended Text:! Large datasets are becoming more common The New York Stock Exchange generates about one terabyte of new trade data per day. Facebook hosts approximately
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A REVIEW ON HIGH PERFORMANCE DATA STORAGE ARCHITECTURE OF BIGDATA USING HDFS MS.
Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! [email protected]
Big Data Processing, 2014/15 Lecture 5: GFS & HDFS!! Claudia Hauff (Web Information Systems)! [email protected] 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind
Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect
Matteo Migliavacca (mm53@kent) School of Computing Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect Simple past - Traditional
An Oracle White Paper June 2012. High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database
An Oracle White Paper June 2012 High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database Executive Overview... 1 Introduction... 1 Oracle Loader for Hadoop... 2 Oracle Direct
Using In-Memory Computing to Simplify Big Data Analytics
SCALEOUT SOFTWARE Using In-Memory Computing to Simplify Big Data Analytics by Dr. William Bain, ScaleOut Software, Inc. 2012 ScaleOut Software, Inc. 12/27/2012 T he big data revolution is upon us, fed
Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,[email protected]
Data Warehousing and Analytics Infrastructure at Facebook Ashish Thusoo & Dhruba Borthakur athusoo,[email protected] Overview Challenges in a Fast Growing & Dynamic Environment Data Flow Architecture,
Log Mining Based on Hadoop s Map and Reduce Technique
Log Mining Based on Hadoop s Map and Reduce Technique ABSTRACT: Anuja Pandit Department of Computer Science, [email protected] Amruta Deshpande Department of Computer Science, [email protected]
Challenges for Data Driven Systems
Challenges for Data Driven Systems Eiko Yoneki University of Cambridge Computer Laboratory Quick History of Data Management 4000 B C Manual recording From tablets to papyrus to paper A. Payberah 2014 2
Comp 5311 Database Management Systems. 16. Review 2 (Physical Level)
Comp 5311 Database Management Systems 16. Review 2 (Physical Level) 1 Main Topics Indexing Join Algorithms Query Processing and Optimization Transactions and Concurrency Control 2 Indexing Used for faster
Hadoop MapReduce over Lustre* High Performance Data Division Omkar Kulkarni April 16, 2013
Hadoop MapReduce over Lustre* High Performance Data Division Omkar Kulkarni April 16, 2013 * Other names and brands may be claimed as the property of others. Agenda Hadoop Intro Why run Hadoop on Lustre?
Parallelizing Structural Joins to Process Queries over Big XML Data Using MapReduce
Parallelizing Structural Joins to Process Queries over Big XML Data Using MapReduce Huayu Wu Institute for Infocomm Research, A*STAR, Singapore [email protected] Abstract. Processing XML queries over
MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012
MapReduce and Hadoop Aaron Birkland Cornell Center for Advanced Computing January 2012 Motivation Simple programming model for Big Data Distributed, parallel but hides this Established success at petabyte
Sector vs. Hadoop. A Brief Comparison Between the Two Systems
Sector vs. Hadoop A Brief Comparison Between the Two Systems Background Sector is a relatively new system that is broadly comparable to Hadoop, and people want to know what are the differences. Is Sector
Constructing a Data Lake: Hadoop and Oracle Database United!
Constructing a Data Lake: Hadoop and Oracle Database United! Sharon Sophia Stephen Big Data PreSales Consultant February 21, 2015 Safe Harbor The following is intended to outline our general product direction.
