Big Data and Scripting map/reduce in Hadoop



Similar documents
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

Hadoop Design and k-means Clustering

Hadoop Streaming. Table of contents

This exam contains 13 pages (including this cover page) and 18 questions. Check to see if any pages are missing.

BIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

Extreme Computing. Hadoop MapReduce in more detail.

Chapter 7. Using Hadoop Cluster and MapReduce

Developing MapReduce Programs

Cloudera Certified Developer for Apache Hadoop

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

Best Practices for Hadoop Data Analysis with Tableau

HADOOP PERFORMANCE TUNING

Complete Java Classes Hadoop Syllabus Contact No:

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Internals of Hadoop Application Framework and Distributed File System

A very short Intro to Hadoop

Hadoop SNS. renren.com. Saturday, December 3, 11

University of Maryland. Tuesday, February 2, 2010

A bit about Hadoop. Luca Pireddu. March 9, CRS4Distributed Computing Group. (CRS4) Luca Pireddu March 9, / 18

The Hadoop Eco System Shanghai Data Science Meetup

Hadoop WordCount Explained! IT332 Distributed Systems

How To Write A Map Reduce In Hadoop Hadooper (Ahemos)

16.1 MAPREDUCE. For personal use only, not for distribution. 333

PassTest. Bessere Qualität, bessere Dienstleistungen!

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

brief contents PART 1 BACKGROUND AND FUNDAMENTALS...1 PART 2 PART 3 BIG DATA PATTERNS PART 4 BEYOND MAPREDUCE...385

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc

Hadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart

map/reduce connected components

HiBench Introduction. Carson Wang Software & Services Group

HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM

HADOOP MOCK TEST HADOOP MOCK TEST II

Using distributed technologies to analyze Big Data

MapReduce. Tushar B. Kute,

How to properly misuse Hadoop. Marcel Huntemann NERSC tutorial session 2/12/13

MapReduce: Algorithm Design Patterns

Big Data Processing with Google s MapReduce. Alexandru Costan

GraySort and MinuteSort at Yahoo on Hadoop 0.23

Introduction to Hadoop

CS 378 Big Data Programming. Lecture 5 Summariza9on Pa:erns

Detection of Distributed Denial of Service Attack with Hadoop on Live Network

FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data

Data Algorithms. Mahmoud Parsian. Tokyo O'REILLY. Beijing. Boston Farnham Sebastopol

Big Data and Apache Hadoop s MapReduce

Constructing a Data Lake: Hadoop and Oracle Database United!

Distributed Computing and Big Data: Hadoop and MapReduce

Lecture 10 - Functional programming: Hadoop and MapReduce

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop

Implement Hadoop jobs to extract business value from large and varied data sets

CS 378 Big Data Programming

Big Data Primer. 1 Why Big Data? Alex Sverdlov alex@theparticle.com

Open source large scale distributed data management with Google s MapReduce and Bigtable

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

COURSE CONTENT Big Data and Hadoop Training

Lecture 3 Hadoop Technical Introduction CSE 490H

Click Stream Data Analysis Using Hadoop

Open source Google-style large scale data analysis with Hadoop

Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park

Assignment 2: More MapReduce with Hadoop

Testing 3Vs (Volume, Variety and Velocity) of Big Data

Hadoop. History and Introduction. Explained By Vaibhav Agarwal

Introduction to MapReduce and Hadoop

Map Reduce & Hadoop Recommended Text:

Analysis of MapReduce Algorithms

Keywords: Big Data, HDFS, Map Reduce, Hadoop

Contents. Pentaho Corporation. Version 5.1. Copyright Page. New Features in Pentaho Data Integration 5.1. PDI Version 5.1 Minor Functionality Changes

Getting to know Apache Hadoop

Facebook s Petabyte Scale Data Warehouse using Hive and Hadoop

Similarity Search in a Very Large Scale Using Hadoop and HBase

Introduction to NoSQL Databases and MapReduce. Tore Risch Information Technology Uppsala University

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

CS 378 Big Data Programming. Lecture 2 Map- Reduce

Peers Techno log ies Pv t. L td. HADOOP

Analysing Large Web Log Files in a Hadoop Distributed Cluster Environment

MAPREDUCE Programming Model

Working With Hadoop. Important Terminology. Important Terminology. Anatomy of MapReduce Job Run. Important Terminology

How To Use Hadoop

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

HadoopRDF : A Scalable RDF Data Analysis System

MapReduce Jeffrey Dean and Sanjay Ghemawat. Background context

Hadoop Introduction. Olivier Renault Solution Engineer - Hortonworks

Infrastructures for big data

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

ITG Software Engineering

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

GraySort on Apache Spark by Databricks

OLH: Oracle Loader for Hadoop OSCH: Oracle SQL Connector for Hadoop Distributed File System (HDFS)

Transcription:

Big Data and Scripting map/reduce in Hadoop 1,

2, parts of a Hadoop map/reduce implementation core framework provides customization via indivudual map and reduce functions e.g. implementation in mongodb Hadoop provides customization at all steps: input reading map combine partition shuffle and sort reduce output

3, input reading split up input file(s) into records of key and value key: usually positional information value: line/part of file keys have no influence on distribution to mappers individual input readers implement InputFormat use to split up e.g. XML-files into records input readers should split, but not parse the input input readers are run on individual blocks executed on nodes that store the corresponding blocks records that could overlap blocks have to be handled manually

4, combine combine patterns emitted on a node before distribution mapper outputs on each node fed into combiner output of combiner distributed to reducers combiners compress output on a semantic level example: word count, reduce 3 (word,1) (word,3) framework implements compression on data level combiners implement the Reducer interface same input/output formats

5, partition distribute key/value pairs to nodes which keys go to which nodes output of partitioner is written to HDFS one file per partition&node nodes that run reducers pull (download) files corresponding to their partition default: hash functions for random uniform distribution partitioner can influence distribution directly e.g. ensure that certain pairs end up in the same node output of these will also end up on the same node interface: Partitioner map key/value partition (int)

6, influencing the shuffle/sort keys are distributed by partitioner to reducer nodes reducers download partition files from mapper-nodes each reducer node starts a shuffle/sort process sort key/value pairs by key only parameter: custom Comparator for keys influence order and equality implement ordering within equivalent classes of keys

7, output analogous to import write local output of reducers to disk/hdfs

8, generic information distribution transport information in addition to key/value pairs examples: small-scale parameters, e.g. number of clusters for k-means large-scale additional information, e.g. look-up tables use JobConf to distribute algorithm parameters set generic parameters using e.g. set(string name, String value) use config() implementation of mapper/reducer/partitioner/combiner to retrieve before execution each class is provided with configuration initialize class variables from general job configuration

9, map reduce design patterns book: MapReduce Design Patterns (Miner, Shook, 2012, OReilly) design patterns describe mechanisms common to many algorithms solve similar problems in a number of contexts in a sense generic algorithms

10, Summarization general intention group records by common field aggregate all value with identical value in field examples: word count, mean, min, max, counting within groups similar to GROUP BY in SQL but with individual aggregation result: table with one entry (row) per group each row contains group key, aggregated value(s)

11, Summarization implementation: map each input element to group (using group as key) remove values not needed for aggregation combine partial aggregation if possible reduce aggregate values and return group key/(aggregated values) combiner can drastically reduce network traffic custom partitioner can help to resolved skewed group-size distribution

12, Counting reduce step not necessary for limited number of counters, can be implemented using only mappers using the Reporter-class pattern: mappers count occurrences, store in Reporter produce no output pairs, no shuffle or reduction functions: Reporter.incrCounter() create/increase counter Reporter.getCounter() read contents aggregated centrally by framework

13, Filtering filter input records by some property limit execution to those that pass examples: distributed grep thresholding data cleansing random sampling as with counting, no reduce is necessary data is read and written from/to local node additional reduce can be used to write filter result to single file many small files slow down computations mapping all filtered results to single reducer allows to compact result

14, Distinct intention return only distinct values from a larger set filter duplicates or values that are highly similar to each other implementation for each record, extract field considered in similarity use those as key, null as value shuffle will transport identical values to one reducer reducer only stores one copy combiner is extremely useful use many reducers single mapper produces many distinct values reducer does not produce high load every distinct value needs another reducer instance

15, structured to hierarchical intention convert a flat format (e.g. an SQL-table with value repetitions) into structured format example: given a table storing values for a foreign key each row contains set of values and key keys are repeated structure by storing key set of values implementation map rows to foreign key (higher element in desired structure) collect all values for each key and store as structured records can be repeated bottom up for several layers of structure used to convert flat data into structured records such as JSON/XML

16, partitioning intention partition data by some property (e.g. data) create smaller groups that can be analyzed individually example group log entries by date to allow analysis on a monthly basis implementation use identity mapper with partition property as key implement custom partitioner, creating partitions by key reducer only writes values to output all data is written to one logical file blocks are distributed as proposed in the partitioner e.g. all blocks for a particular month are on one DataNode

17, sorting with the shuffle step implementation of sort algorithm can exploit the shuffle/sort step use the idea presented before: analyze phase determines sorting buckets order phase sorts analyze (first m/r-step) sample input and map to sort keys, without value set number of reducers to one shuffling sorts keys, single reducer gets sorted list of keys create slices of equal size

18, sorting - order phase mapper maps to sort key, values attached this time custom partitioner applies partition from previous step to input values one reducer for each partition reducers receive data ordered by keys only write incoming data to file output is written to part-r-* files (number instead of *) file ordering corresponds to ordering of values values within files are sorted

19, shuffle intention: destroy order of data arrangement/create random order/partitions motivation/applications anonymizing repeatable random sampling distribution of highly accessed parts to multiple nodes pattern map values to random key shuffling implements random distribution reducer only writes and produces random partitions with random order

20, Joins join as in SQL JOIN, join two tables by common key implementation map values to join key use partitioner for even distribution to reducers reducer collects values in temporary lists using external storage if necessary one list for each source table in final step reducer produces output pairs from lists can be used to implement all types of joins: inner, outer, left, right, anti all elements with common key are computed on single reducer can be problematic when many values have the same key

21, replicated join join very large data set with many small data sets large data set is left output elements correspond to elements of large data set small data sets fit into memory implementation mapper reads smaller data sets in initialization processes each record by joining it with elements from small data sets no combine/shuffle/reduce joined data is written directly after mapping

22, cartesian product intention: create all pairs of input values from data sets A and B application: pairwise analysis of elements (e.g. determine distance matrix) pattern: create partitions of both data sets: A 1,..., A n (n parts), B 1,..., B m (m parts) repeat values in each partition A 1 1,..., Am 1,... use n m reducers each reducer receives all values from pair of parts A i, B j reducer produces all possible pairings