CLOUD COMPUTING INTRODUCTION TO DATA SCIENCE TIM KRASKA

Similar documents
Cloud Computing using MapReduce, Hadoop, Spark

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

CSE-E5430 Scalable Cloud Computing Lecture 2

EXPERIMENTATION. HARRISON CARRANZA School of Computer Science and Mathematics

Open source Google-style large scale data analysis with Hadoop

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

BBM467 Data Intensive ApplicaAons

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

MapReduce. Introduction and Hadoop Overview. 13 June Lab Course: Databases & Cloud Computing SS 2012

Big Data and Apache Hadoop s MapReduce

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Data-Intensive Computing with Map-Reduce and Hadoop

Open source large scale distributed data management with Google s MapReduce and Bigtable

Apache Hadoop FileSystem and its Usage in Facebook

Chapter 7. Using Hadoop Cluster and MapReduce

Hadoop Architecture. Part 1

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc

Hadoop & its Usage at Facebook

Using Hadoop for Webscale Computing. Ajay Anand Yahoo! Usenix 2008

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

Big Data Technology Core Hadoop: HDFS-YARN Internals

GraySort and MinuteSort at Yahoo on Hadoop 0.23

Apache Hadoop FileSystem Internals

Hadoop & its Usage at Facebook


Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

CS54100: Database Systems

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com

Apache Hadoop. Alexandru Costan

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

Apache HBase. Crazy dances on the elephant back

Hadoop and its Usage at Facebook. Dhruba Borthakur June 22 rd, 2009

Design and Evolution of the Apache Hadoop File System(HDFS)

Introduction to Hadoop

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Big Data Storage, Management and challenges. Ahmed Ali-Eldin

Lecture 10 - Functional programming: Hadoop and MapReduce

Big Data With Hadoop

A very short Intro to Hadoop

Hadoop IST 734 SS CHUNG

Map Reduce / Hadoop / HDFS

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

Intro to Map/Reduce a.k.a. Hadoop

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

Application Development. A Paradigm Shift

Introduction to MapReduce and Hadoop

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Map Reduce & Hadoop Recommended Text:

Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13

Google File System. Web and scalability

BIG DATA TECHNOLOGY. Hadoop Ecosystem

Google Bing Daytona Microsoft Research

Hadoop Parallel Data Processing

Hadoop Distributed File System. Jordan Prosch, Matt Kipps

Hadoop Ecosystem B Y R A H I M A.

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

Large scale processing using Hadoop. Ján Vaňo

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15

Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com

Hadoop implementation of MapReduce computational model. Ján Vaňo

BookKeeper. Flavio Junqueira Yahoo! Research, Barcelona. Hadoop in China 2011

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee June 3 rd, 2008

Big Data Processing. Patrick Wendell Databricks

Parallel Processing of cluster by Map Reduce

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Hadoop Distributed File System (HDFS) Overview

BIG DATA USING HADOOP

Cloud Computing using MapReduce, Hadoop, Spark

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

Hadoop-BAM and SeqPig

Realtime Apache Hadoop at Facebook. Jonathan Gray & Dhruba Borthakur June 14, 2011 at SIGMOD, Athens

Big Data - What is it, How to use it?

A bit about Hadoop. Luca Pireddu. March 9, CRS4Distributed Computing Group. (CRS4) Luca Pireddu March 9, / 18

MapReduce with Apache Hadoop Analysing Big Data

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Accelerating and Simplifying Apache

Networking in the Hadoop Cluster

CS246: Mining Massive Datasets Jure Leskovec, Stanford University.

How To Use Hadoop

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop

Data Deluge. Billions of users connected through the Internet. Storage getting cheaper

The MapReduce Framework

Hadoop: Embracing future hardware

Convex Optimization for Big Data: Lecture 2: Frameworks for Big Data Analytics

Mesos: A Platform for Fine- Grained Resource Sharing in Data Centers (II)

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

MapReduce Job Processing

Improving MapReduce Performance in Heterogeneous Environments

NoSQL and Hadoop Technologies On Oracle Cloud

MapReduce and Intro to Cloud Computing. ID2210: Lecture 11 Jim Dowling

Big Data Primer. 1 Why Big Data? Alex Sverdlov alex@theparticle.com

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS

Transcription:

CLOUD COMPUTING INTRODUCTION TO DATA SCIENCE TIM KRASKA

CLICKER How was the celebration of knowledge? A) Very Easy B) Just ok C) Tough D) Spring break made me forget about it E) I do not want to talk about it

BACKGROUND OF CLOUD COMPUTING 1980 s and 1990 s: 52% growth in performance per year! 2002: The thermal wall Speed (frequency) peaks, but transistors keep shrinking 2000 s: Multicore revolution 15-20 years later than predicted, we have hit the performance wall 2010 s: Rise of Big Data

SOURCES DRIVING BIG DATA It s All Happening On-line Every: Click Ad impression Billing event Fast Forward, pause, Friend Request Transaction Network message Fault User Generated (Web & Mobile).. Internet of Things / M2M Scientific Computing

DATA DELUGE Billions of users connected through the net WWW, FB, twitter, cell phones, 80% of the data on FB was produced last year Storage getting cheaper Store more data!

Increase over 2010 DATA GROWS FASTER THAN MOORE S LAW 60 Projected Growth 50 40 30 20 Moore's Law Particle Accel. DNA Sequencers 10 0 2010 2011 2012 2013 2014 2015

SOLVING THE IMPEDANCE MISMATCH Computers not getting faster, and we are drowning in data How to resolve the dilemma? Solution adopted by web-scale companies Go massively distributed and parallel

ENTER THE WORLD OF DISTRIBUTED SYSTEMS Distributed Systems/Computing Loosely coupled set of computers, communicating through message passing, solving a common goal Tools: Msg passing, Distributed shared memory, RPC Distributed computing is challenging Dealing with partial failures (examples?) Dealing with asynchrony (examples?) Dealing with scale (examples?) Dealing with consistency (examples?) Distributed Computing versus Parallel Computing? distributed computing=parallel computing + partial failures

THE DATACENTER IS THE NEW COMPUTER The datacenter as a computer still in its infancy Special purpose clusters, e.g., Hadoop cluster Built from less reliable components Highly variable performance Complex concepts are hard to program (low-level primitives) =?

DATA CENTER Image from http://wiki.apache.org/hadoop-data/attachments/hadooppresentations/attachments/aw-apachecon-eu-2009.pdf

DATACENTER/CLOUD COMPUTING OS If the datacenter/cloud is the new computer What is its Operating System? Note that we are not talking about a host OS Could be equivalent in benefit as the LAMP stack was to the.com boom every startup secretly implementing the same functionality! Open source stack for a Web 2.0 company: Linux OS Apache web server MySQL, MariaDB or MongoDB DBMS PHP, Perl, or Python languages for dynamic web pages

CLASSICAL OPERATING SYSTEMS Data sharing Inter-Process Communication, RPC, files, pipes, Programming Abstractions Libraries (libc), system calls, Multiplexing of resources Scheduling, virtual memory, file allocation/protection,

DATACENTER/CLOUD OPERATING SYSTEM Data sharing Google File System, key/value stores Apache project: Hadoop Distributed File System Programming Abstractions Google MapReduce Apache projects: Hadoop, Pig, Hive, Spark Multiplexing of resources Apache projects: Mesos, YARN (MapReduce v2), ZooKeeper, BookKeeper,

GOOGLE CLOUD INFRASTRUCTURE Google File System (GFS), 2003 Distributed File System for entire cluster Single namespace Google MapReduce (MR), 2004 Runs queries/jobs on data Manages work distribution & faulttolerance Colocated with file system Apache open source versions: Hadoop DFS and Hadoop MR

HADOOP DISTRIBUTED FILE SYSTEM Files split into 128MB blocks Blocks replicated across several datanodes (usually 3) Single namenode stores metadata (file names, block locations, etc) Namenode File1 1 2 3 4 Optimized for large files, sequential reads Files are append-only Data striped on hundreds/thousands of servers Scan 100 TB on 1 node @ 50 MB/s = 24 days Scan on 1000-node cluster = 35 minutes 1 2 4 2 1 3 1 4 3 Datanodes 3 2 4

CLICKER The chance of a machine failing in 24h is 0.1% What is the likelihood that one machine in a cluster of 1000 machines fails in 24h? a) 0.1% b) 10% c) 63% d) 99.999%

GFS/HDFS INSIGHTS (2) Failures will be the norm Mean time between failures for 1 node = 3 years Mean time between failures for 1000 nodes = 1 day Use commodity hardware Failures are the norm anyway, buy cheaper hardware No complicated consistency models Single writer, append-only data

WHAT IS MAPREDUCE? Simple data-parallel programming model designed for scalability and fault-tolerance Pioneered by Google Processes 20 petabytes of data per day Popularized by open-source Hadoop project Used at Yahoo!, Facebook, Amazon,

WHAT IS MAPREDUCE USED FOR? At Google: Index building for Google Search Article clustering for Google News Statistical machine translation At Yahoo!: Index building for Yahoo! Search Spam detection for Yahoo! Mail At Facebook: Data mining Ad optimization Spam detection

TYPICAL HADOOP CLUSTER Aggregation switch Rack switch 40 nodes/rack, 1000-4000 nodes in cluster 1 Gbps bandwidth within rack, 8 Gbps out of rack Node specs (Yahoo terasort): 8 x 2GHz cores, 8 GB RAM, 4 disks (= 4 TB?) Image from http://wiki.apache.org/hadoop-data/attachments/hadooppresentations/attachments/yahoohadoopintro-apachecon-us-2008.pdf

CHALLENGES Cheap nodes fail, especially if you have many Mean time between failures for 1 node = 3 years Mean time between failures for 1000 nodes = 1 day Solution: Build fault-tolerance into system Commodity network = low bandwidth Solution: Push computation to the data Programming distributed systems is hard Solution: Data-parallel programming model: users write map & reduce functions, system distributes work and handles faults

MAPREDUCE PROGRAMMING MODEL Data type: key-value records Map function: (K in, V in ) list(k inter, V inter ) Reduce function: (K inter, list(v inter )) list(k out, V out )

EXAMPLE: WORD COUNT def mapper(line): foreach word in line.split(): output(word, 1) def reducer(key, values): output(key, sum(values))

WORD COUNT EXECUTION Input Map Shuffle & Sort Reduce Output the quick brown fox the fox ate the mouse Map Map the, 1 fox, 1 the, 1 the, 1 brown, 1 fox, 1 quick, 1 Reduce brown, 2 fox, 2 how, 1 now, 1 the, 3 how now brown cow Map how, 1 now, 1 brown, 1 ate, 1 mouse, 1 cow, 1 Reduce ate, 1 cow, 1 mouse, 1 quick, 1

MAPREDUCE EXECUTION DETAILS Single master controls job execution on multiple slaves Mappers preferentially placed on same node or same rack as their input block Minimizes network usage Mappers save outputs to local disk before serving them to reducers Allows recovery if a reducer crashes Allows having more reducers than nodes

AN OPTIMIZATION: THE COMBINER A combiner is a local aggregation function for repeated keys produced by same map Works for associative functions like sum, count, max Decreases size of intermediate data Example: map-side aggregation for Word Count: def combiner(key, values): output(key, sum(values))

WORD COUNT WITH COMBINER Input Map & Combine Shuffle & Sort Reduce Output the quick brown fox Map the, 1 brown, 1 fox, 1 Reduce brown, 2 fox, 2 how, 1 the fox ate the mouse Map the, 2 fox, 1 quick, 1 now, 1 the, 3 how now brown cow Map how, 1 now, 1 brown, 1 ate, 1 mouse, 1 cow, 1 Reduce ate, 1 cow, 1 mouse, 1 quick, 1

FAULT TOLERANCE IN MAPREDUCE 1. If a task crashes: Retry on another node OK for a map because it has no dependencies OK for reduce because map outputs are on disk If the same task fails repeatedly, fail the job or ignore that input block (user-controlled) Note: For these fault tolerance features to work, your map and reduce tasks must be side-effect-free

FAULT TOLERANCE IN MAPREDUCE 2. If a node crashes: Re-launch its current tasks on other nodes Re-run any maps the node previously ran Necessary because their output files were lost along with the crashed node

FAULT TOLERANCE IN MAPREDUCE 3. If a task is going slowly (straggler): Launch second copy of task on another node ( speculative execution ) Take the output of whichever copy finishes first, and kill the other Surprisingly important in large clusters Stragglers occur frequently due to failing hardware, software bugs, misconfiguration, etc Single straggler may noticeably slow down a job

TAKEAWAYS By providing a data-parallel programming model, MapReduce can control job execution in useful ways: Automatic division of job into tasks Automatic placement of computation near data Automatic load balancing Recovery from failures & stragglers User focuses on application, not on complexities of distributed computing

MAPREDUCE PROS Distribution is completely transparent Not a single line of distributed programming (ease, correctness) Automatic fault-tolerance Determinism enables running failed tasks somewhere else again Saved intermediate data enables just re-running failed reducers Automatic scaling As operations as side-effect free, they can be distributed to any number of machines dynamically Automatic load-balancing Move tasks and speculatively execute duplicate copies of slow tasks (stragglers)