Introduction to DISC and Hadoop

Similar documents

MAPREDUCE Programming Model

Hadoop and Map-Reduce. Swati Gore

Analysis of MapReduce Algorithms

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

Chapter 7. Using Hadoop Cluster and MapReduce

MASSIVE DATA PROCESSING (THE GOOGLE WAY ) 27/04/2015. Fundamentals of Distributed Systems. Inside Google circa 2015

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

CSE-E5430 Scalable Cloud Computing Lecture 2

Implement Hadoop jobs to extract business value from large and varied data sets

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

MapReduce and Hadoop Distributed File System V I J A Y R A O

MapReduce and Hadoop Distributed File System

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

Big Data and Scripting Systems beyond Hadoop

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Big Data and Apache Hadoop s MapReduce

R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop

Hadoop Parallel Data Processing

Optimization and analysis of large scale data sorting algorithm based on Hadoop

HiBench Introduction. Carson Wang Software & Services Group

16.1 MAPREDUCE. For personal use only, not for distribution. 333

Internals of Hadoop Application Framework and Distributed File System

A bit about Hadoop. Luca Pireddu. March 9, CRS4Distributed Computing Group. (CRS4) Luca Pireddu March 9, / 18

Data-Intensive Computing with Map-Reduce and Hadoop

Parallel Processing of cluster by Map Reduce

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

From GWS to MapReduce: Google s Cloud Technology in the Early Days

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

The Hadoop Framework

Lecture 10 - Functional programming: Hadoop and MapReduce

International Journal of Innovative Research in Computer and Communication Engineering

Hadoop IST 734 SS CHUNG

Architectures for massive data management

Developing a MapReduce Application

marlabs driving digital agility WHITEPAPER Big Data and Hadoop

Introduction to Hadoop

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

White Paper. Big Data and Hadoop. Abhishek S, Java COE. Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP

Apache Hadoop. Alexandru Costan

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

MapReduce. MapReduce and SQL Injections. CS 3200 Final Lecture. Introduction. MapReduce. Programming Model. Example

Big Data Analysis and HADOOP

An Oracle White Paper November Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics

Big Data With Hadoop

Task Scheduling in Hadoop

NoSQL and Hadoop Technologies On Oracle Cloud

Tools Page 1 of 13 ON PROGRAM TRANSLATION. A priori, we have two translation mechanisms available:

Big Data with Rough Set Using Map- Reduce

Systolic Computing. Fundamentals

How To Handle Big Data With A Data Scientist

Management & Analysis of Big Data in Zenith Team

Bringing Big Data Modelling into the Hands of Domain Experts

MapReduce on GPUs. Amit Sabne, Ahmad Mujahid Mohammed Razip, Kun Xu

Processing of Hadoop using Highly Available NameNode

MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research

I/O Considerations in Big Data Analytics

Keywords: Big Data, HDFS, Map Reduce, Hadoop

Cloud Computing at Google. Architecture

MapReduce. Tushar B. Kute,

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop

Cloud Platforms, Challenges & Hadoop. Aditee Rele Karpagam Venkataraman Janani Ravi

Cloud Computing Summary and Preparation for Examination

COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Big Data by the numbers

Enhancing MapReduce Functionality for Optimizing Workloads on Data Centers

LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT

An Experimental Approach Towards Big Data for Analyzing Memory Utilization on a Hadoop cluster using HDFS and MapReduce.

Log Mining Based on Hadoop s Map and Reduce Technique

BSPCloud: A Hybrid Programming Library for Cloud Computing *

A Performance Analysis of Distributed Indexing using Terrier

This exam contains 13 pages (including this cover page) and 18 questions. Check to see if any pages are missing.

Big Data and Scripting map/reduce in Hadoop

Introduction to Cloud Computing

<Insert Picture Here> Oracle and/or Hadoop And what you need to know

Big Data and Hadoop. Sreedhar C, Dr. D. Kavitha, K. Asha Rani

PARALLEL PROGRAMMING

Chapter 7: Distributed Systems: Warehouse-Scale Computing. Fall 2011 Jussi Kangasharju

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

What Is In-Memory Computing and What Does It Mean to U.S. Leaders? EXECUTIVE WHITE PAPER

Developing MapReduce Programs

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc

A Novel Optimized Mapreduce across Datacenter s Using Dache: a Data Aware Caching for BigData Applications

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

A Brief Outline on Bigdata Hadoop

Hadoop and Map-reduce computing

Data Mining in the Swamp

A Novel Cloud Based Elastic Framework for Big Data Preprocessing

Distributed Apriori in Hadoop MapReduce Framework

Energy Efficient MapReduce

RevoScaleR Speed and Scalability

Mining Large Datasets: Case of Mining Graph Data in the Cloud

Transcription:

Introduction to DISC and Hadoop Alice E. Fischer April 24, 2009 Alice E. Fischer DISC... 1/20

1 2 History Hadoop provides a three-layer paradigm Alice E. Fischer DISC... 2/20

Parallel Computing Past and Presen Alice E. Fischer DISC... 3/20

Uses parallel computation to process masses of data quickly. Scales easily to larger or smaller data sets. Is robust and can survive failure of one or more nodes during a computation. Alice E. Fischer DISC... 4/20

Hadoop Outline Is a framework for implementing DISC. Program segments may be written in Java or Python. Runs on cluster computers. (One master node, several or many slaves.) Is based on a distributed and replicated file system. Automatically manages the hardware facility, which frequently expands or contracts. Allows use of cloud computing, that is, hardware that non-local and not owned. Alice E. Fischer DISC... 5/20

Programmer-directed Parallelism A language can provide primitives for creating and managing parallelism: Fine-grained, programmer directed: parbegin and parend (ALGOL-68) Functional languages with lazy evaluation avoid specification of order. Course-grained parallelism: fork and join (UNIX) Java threads and synchronization primitives. Alice E. Fischer DISC... 6/20

Hardware and Compiler-based Parallelism Computer systems often provide parallel hardware: Floating-point co-processors: 1980 s, 8086 chip. Pipelined architecture: 1980 s Dual-core processors: Pentium Extreme, 2005 Quad-core: AMD Phenom, 2008 The extent to which these can be exploited depends on the cleverness of compiler designers. Alice E. Fischer DISC... 7/20

Parallelism Based on Memory Architecture A Hypercube has a large number of memory nodes cross-connected so that a node is adjacent to others in 2, 3, or 4 dimensions. Communication between two adjacent nodes is fast; between distant nodes is slower. One process controls the computation and flow of information among the many nodes. Alice E. Fischer DISC... 8/20

Parallelism Based on Cluster Computing The Linda system, based on David Gelernter s tuple spaces has been implemented on multiple computers sharing a local network. A tuple space is an implementation of associative memory for parallel/distributed computing. It provides a repository of tuples that can be accessed concurrently. For example, consider a group of processors that produce data and post it as tuples in the space, and a group of processors that retrieve and use data from the space that match a certain pattern. Alice E. Fischer DISC... 9/20

Yes. The distributed / replicated file system goes beyond Linda. Unlike Java threads, and UNIX processes, the programmer does not need to worry about synchronization. Configuration is easy and management of resources automatic. It runs on ordinary computer hardware. It is open source and therefore free. Alice E. Fischer DISC... 10/20

What kind of applications are appropriate? Hadoop is useful for some kinds of problems: Data records must be largely independent of each other the order in which they are processed must not matter. Data must require too much space and/or too much processing time to be handled by one computer. At some final stage, results from the parallel computations need to be related to each other. Alice E. Fischer DISC... 11/20

Possible Hadoop applications Google s page-rank algorithm: masses of web-links are reduced daily to ordered page ranks. Large-scale searches; data mining. (The results of parallel searches can be related to each other.) Typesetting a massive LaTex document. (Most typesetting actions are local, the final pagination relates them.) IRS: matching employer copies of W2 forms to employee returns. (Data can be reordered, then collated.) Cinematography: Process the frames of a movie to create a ghost by making a character transparent in every screen. The frames would be processed in parallel and assembled in order at the last stage. Alice E. Fischer DISC... 12/20

Sequential algorithms. Problems that are inherently sequential benefit little from parallel computation. Example: simulating a sequential process. Some problems involve massive unstructured data, but do not need a final step that integrates or correlates the data. Example: Processing the IRS refund checks. These can simply be done by a battery of separate machines. Alice E. Fischer DISC... 13/20

Hadoop and Map-reduce do not always fit the bill. Hadoop provides the wrong kind of parallelism here: Edge detection, object detection, and smoothing operations on a digital image. Solving systems of partial differential equations by simulation. Earthquake calculations involve processing 8-dimensional earth models. There is a huge amount of data, but each cell in the model affects computations at the next time step on all adjacent cells. The data is too structured for Hadoop. (Use a hypercube.) Simulating the growth of a population in a non-uniform and limited environment. Hadoop does not fit the iterative nature of the problem, the increase in nodes at each generation or the non-uniform processing of births, deaths, and migrations. Alice E. Fischer DISC... 14/20

Commercial Involvement History Hadoop provides a three-layer paradigm Google released an early and partial description of map-reduce. Google wants the open source community to develop a competing product. Google wants students to learn about map-reduce and how to program in Hadoop. IBM and Yahoo are supporting the development and teaching of Hadoop. Alice E. Fischer DISC... 15/20

History Hadoop provides a three-layer paradigm Hadoop Essentials The Replicated, Distributed File System Phase 1: Mapping Phase 2: Collection and Redistribution Phase 3: Reducing Alice E. Fischer DISC... 16/20

The File System Outline History Hadoop provides a three-layer paradigm The Google file system is fundamental to the map-reduce paradigm. Hadoop is based on an implementation of the same idea. A file is: Distributed: Segments of a file are kept on many different machines. Replicated: Each segment is stored three times on nodes that are unlikely to fail simultaneously. Managed: If a node fails, its segments are lost, but two copies remain of each. A third copy of each segment is promptly made on another node. Alice E. Fischer DISC... 17/20

Phase 1: Mapping Outline History Hadoop provides a three-layer paradigm For each segment of data in the input file, one of the machines that stores it is assigned to run the mapping code. The output of a mapper is a set of key, record pairs. Users can optionally specify a Combiner to perform local aggregation of the intermediate outputs on the mapper nodes. This helps to cut down the amount of data transferred from the Mapper to the Reducers. Alice E. Fischer DISC... 18/20

Phase 2: Collection and Redistribution History Hadoop provides a three-layer paradigm This step involves Shuffle and Sort actions that occur simultaneously and probably run on the same nodes as Phase 3. The overall action is like a mergesort. The framework collects and massages the output of the Mappers: The key-groups, are partitioned among the Reducers. Users can control which keys (and hence which records) go to which Reducer by implementing a custom Partitioner. Shuffle: For each Reducer, the framework fetches the relevant portion of the output of all the mappers, via HTTP. Sort: The framework groups Mapper outputs by keys, combining like keys produced by different Mappers. The key-groups then become input to the Reducers. Alice E. Fischer DISC... 19/20

Phase 3: Reducing Outline History Hadoop provides a three-layer paradigm The reducing code can be written in Java, Python, etc. A key group becomes the input to the reduce() function running on each Reducer. Each Reducer handles one key, or several, if there are more keys than reducers. It reduces the set of intermediate values which share a key to a smaller set of values. The smaller set, often with derived information added and data fields rearranged in a useful manner, becomes the output. This output might become the input to another map-reduce step. Alice E. Fischer DISC... 20/20