CIS 4930/6930 Spring 2014 Introduction to Data Science /Data Intensive Computing. University of Florida, CISE Department Prof.



Similar documents
Jeffrey D. Ullman slides. MapReduce for data intensive computing

CS246: Mining Massive Datasets Jure Leskovec, Stanford University.

Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

CSE-E5430 Scalable Cloud Computing Lecture 2

Hadoop Architecture. Part 1

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Open source large scale distributed data management with Google s MapReduce and Bigtable

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Chapter 7. Using Hadoop Cluster and MapReduce

Big Data Processing with Google s MapReduce. Alexandru Costan

Big Data With Hadoop

Hadoop IST 734 SS CHUNG

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Parallel Processing of cluster by Map Reduce

Big Data Storage, Management and challenges. Ahmed Ali-Eldin

Map Reduce / Hadoop / HDFS

Hadoop & its Usage at Facebook

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Hadoop Parallel Data Processing

Cloud Computing at Google. Architecture

MapReduce (in the cloud)

Intro to Map/Reduce a.k.a. Hadoop

Hadoop Cluster Applications

Open source Google-style large scale data analysis with Hadoop

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

Google File System. Web and scalability

Data-Intensive Computing with Map-Reduce and Hadoop

The Google File System

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Hadoop & its Usage at Facebook

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

Bringing Big Data Modelling into the Hands of Domain Experts

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

THE HADOOP DISTRIBUTED FILE SYSTEM

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

Big Data Primer. 1 Why Big Data? Alex Sverdlov alex@theparticle.com

Big Data and Apache Hadoop s MapReduce

NoSQL and Hadoop Technologies On Oracle Cloud

MapReduce and Hadoop Distributed File System V I J A Y R A O

Big Data Processing in the Cloud. Shadi Ibrahim Inria, Rennes - Bretagne Atlantique Research Center

Log Mining Based on Hadoop s Map and Reduce Technique

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

GraySort and MinuteSort at Yahoo on Hadoop 0.23

From GWS to MapReduce: Google s Cloud Technology in the Early Days

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data

GeoGrid Project and Experiences with Hadoop

Hadoop implementation of MapReduce computational model. Ján Vaňo

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5

Large scale processing using Hadoop. Ján Vaňo

Data Science in the Wild

Apache Hadoop FileSystem and its Usage in Facebook

Reduction of Data at Namenode in HDFS using harballing Technique

Performance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems

MapReduce and Hadoop Distributed File System

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15

Scaling Out With Apache Spark. DTL Meeting Slides based on

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems

Comparison of Different Implementation of Inverted Indexes in Hadoop

A programming model in Cloud: MapReduce

Distributed File Systems

COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Big Data by the numbers

Keywords: Big Data, HDFS, Map Reduce, Hadoop


Recognization of Satellite Images of Large Scale Data Based On Map- Reduce Framework

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee June 3 rd, 2008

How To Use Hadoop

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

BlobSeer: Towards efficient data storage management on large-scale, distributed systems

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

Journal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS)

Hadoop and its Usage at Facebook. Dhruba Borthakur June 22 rd, 2009

What is cloud computing?

Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13

Can High-Performance Interconnects Benefit Memcached and Hadoop?

Big Application Execution on Cloud using Hadoop Distributed File System

16.1 MAPREDUCE. For personal use only, not for distribution. 333

Comparative analysis of Google File System and Hadoop Distributed File System

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Introduction to Hadoop

A Performance Analysis of Distributed Indexing using Terrier

A very short Intro to Hadoop

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

Introduction to Parallel Programming and MapReduce

5 SCS Deployment Infrastructure in Use

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

International Journal of Advance Research in Computer Science and Management Studies

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

BBM467 Data Intensive ApplicaAons

Transcription:

CIS 4930/6930 Spring 2014 Introduction to Data Science /Data Intensie Computing Uniersity of Florida, CISE Department Prof. Daisy Zhe Wang

Map/Reduce: Simplified Data Processing on Large Clusters Parallel/Distributed Computing Distributed File System M/R Programming Model Parallel Analytics using M/R Adapted Slides from Jeff Ullman, Anand Rajaraman and Jure Lesoec from Stanford

Parallel Computing MapReduce is designed for parallel computing Before MapReduce Enterprise: a few super-computers, parallelism is achieed by parallel DBs (e.g., Teradata) Science, HPC: MPI, openmp oer commodity computer clusters or oer super-computers for parallel computing, do not handle failures (i.e., Fault tolerance) Today: Enterprise: map-reduce, parallel DBs Science, HPC: openmp, MPI and map-reduce MapReduce can apply oer both multicore and distributed enironment 3

Data Analysis on Single Serer Many data sets can be ery large Tens to hundreds of terabytes Cannot always perform data analysis on large datasets on a single serer (why?) Run time (i.e., CPU), memory, dis space CPU CPU CPU CPU Memory Memory Dis Dis Dis Dis

Motiation: Google s Problem 20+ billion web pages * 20KB = 400+ TB 1 computer reads 30-35 MB/sec from dis ~4 months to read the web ~1000 hard dries to store the web Taes een more to do something useful with the data! CPU Memory 5

Distributed Computing Challenges: Distributed/parallel programming is hard How to distribute Data? Computation? Scheduling? Fault Tolerant? Debugging? On the software side: what is the programming model? On the hardware side: how to deal with hardware failures? MapReduce/Hadoop addresses all of the aboe Google s computational/data manipulation model Elegant way to wor with big data 6

Commodity Clusters Recently standard commodity architecture for such problems: Cluster of commodity Linux nodes Gigabit ethernet interconnect Challenge: How to organize computations on this architecture? handle issues such as hardware failure

Cluster Architecture 1 Gbps between any pair of nodes in a rac Switch 2-10 Gbps bacbone between racs Switch Switch CPU CPU CPU CPU Mem Mem Mem Mem Dis Dis Dis Dis Each rac contains 16-64 nodes

Example Cluster Datacenters In 2011, it was gustimated that Google had 1M machines. Amazon Cloud? Probably millions nodes today AWB Piotal Analytics Wor Bench, 1000 nodes, SW and irtualization proided Hypergator UF s cluster datacenter, no API, no irtualization 9

Problems with Cluster Datacenters Hardware Failures Machine failures Diss failure, CPU failure, Data corruption One serer may stay up 3 years (1,000 days) If you hae 1,000 serers, expect to loose 1/day With 1M machines 1,000 machines fail eery day! Networ failure Copy data oer a networ taes time 10

Stable storage First order problem: if nodes can fail, how can we store data persistently? Idea: Store files multiple times for reliability Answer: Distributed File System Proides global file namespace Google GFS; Hadoop HDFS Typical usage pattern Huge files (100s of GB to TB) Data is rarely updated in place Reads and appends are common

Distributed File System Chun Serers File is split into contiguous chuns Typically each chun is 16-64MB Each chun replicated (usually 2x or 3x) Try to eep replicas in different racs Master node a..a. Name Nodes in HDFS Stores metadata Liely to be replicated as well Client library for file access Tals to master to find chun serers Connects directly to chun serers to access data

Reliable Distributed File System Data ept in chuns spread across machines Each chun replicated on different machines Seamless recoery from dis or machine failure 13

MapReduce Motiation Large-Scale Data Processing Want to use 1000s of CPUs, But don t want the hassle of scheduling, synchronization, etc. MapReduce proides Automatic parallelization & distribution Fault tolerance I/O scheduling and synchronization Monitoring & status updates

MapReduce Basics MapReduce Programming model similar to Lisp (and other functional languages) Many problems can be phrased using Map and Reduce functions Benefits for implementing an algorithm in MapReduce API Automatic parallelization Easy to distribute across nodes Nice retry/failure semantics

Example Problem: Word Count We hae a huge text file: doc.txt or a large corpus of documents: docs/* Count the number of times each distinct word appears in the file Sample application: analyze web serer logs to find popular URLs

Word Count (2) Case 1: File too large for memory, but all <word, count> pairs fit in memory Case 2: File on dis, too many distinct words to fit in memory Count occurrences of words words(doc.txt) sort uniq c words taes a file and output the words in it, one word a line Naturally capture the essence of MapReduce and naturally parallelizable

3 steps of MapReduce Sequentially read a lot of data Map: Extract something you care about Group by ey: Sort and shuffle Reduce: Aggregate, summarize, filter or transform Output the result For different problems, steps stay the same, Map and Reduce change to fit the problem 18

MapReduce: more detail Input: a set of ey/alue pairs Programmer specifies two methods Map(,) list(1,1) Taes a ey-alue pair and outputs a set of eyalue pairs (e.g., ey is the file name, alue is a single line in the file) One map call for eery (,) pair Reduce(1, list(1)) list (1, 2) All alues 1 with the same ey 1 are reduced together to produce a single alue 2 There is one reduce call per unique ey 1 Group by ey is executed by default in the same way: Sort(list(1,1)) list (1, list(1))

MapReduce: The Map Step Input ey-alue pairs Intermediate ey-alue pairs map map

MapReduce: The Group and Reduce Step Intermediate ey-alue pairs Key-alue groups reduce Output ey-alue pairs group reduce

Word Count using MapReduce map(ey, alue): // ey: document name; alue: text of document for each word w in alue: emit(w, 1) reduce(ey, alues): // ey: a word; alue: an iterator oer counts result = 0 for each count in alues: result += emit(result)

MapReduce: Word Count 23