MapReduce and Intro to Cloud Computing. ID2210: Lecture 11 Jim Dowling

Transcription

1 MapReduce and Intro to Cloud Computing ID2210: Lecture 11 Jim Dowling

2 Large-Scale Distributed Computing In #Nodes - BitTorrent (millions) - Peer-to-Peer In #Instructions/sec - Teraflops, Petaflops, Exascale - Super-Computing In #Bytes stored - Facebook: 30+ Petabytes (July 11)* - NoSQL storages In #Bytes processed/time - Google processes 24 petabytes of data per day -? * moves_30_petabyte_hadoop_cluster_to_new_data_center

3 Big Data The total amount of digital storage worldwide is approaching 1 zettabyte, or 1 million times the contents of the Earth s largest library. Currently that information is archived on equipment with a mass equivalent to 20 percent of Manhattan. Global data storage is expected to reach 35 zettabytes by 2020 The Boston Globe Editorial & Opinion September 7, 2010

4 Programming Large-Scale With thousands of servers available within a data centre, how do we: - write applications for them? - allocate and manage resources? Applications should also be scalable, reliable, and highly available. - Failures are expected with thousands of machines. - Need for load-balancing, handling heterogeneity.

5 Commodity Computing Challenges Cheap nodes fail, especially if you have many - Mean time between failures for 1 node = 3 years - Mean time between failures for 1000 nodes = 1 day - Solution: Build fault-tolerance into system Commodity networks have low(ish) bandwidth - Scan 100TB Datasets on 1000 node cluster with remote 10MB/s = 165 mins - Solution: Push computation to the data Programming distributed systems is hard - Solution: Provide a simple to use data-parallel programming model that distributes work and handles faults.

6 Typical Large-Scale Programming Problem Iterate over a large number of records Extract something of interest from each Map Shuffle and sort intermediate results Aggregate intermediate results Reduce Generate final output Key idea: provide a functional abstraction for these two operations

7 Programming in the Large Imagine we have hundreds of thousands of documents structured as follows: { "type": "blog", "id": "564", "tags": ["hdfs", "mysql", "cats"], "content": "<div>...</div>", "mentions": [ { "google": 6, "apple": 11, "microsoft": 1, } ] }

8 Extract all the mentions of Google

9 Count Mentions in each Blog, then Sum Up

10 Pseudocode for Map Phase Extract the blog_id and the number of mentions of google from each document def mapper(doc.blogs): foreach (blog in docs.blogs): output(blog_id, mentions_google);

11 Pseudocode for Reduce Phases Sum up all the google mentions from the same blog_id def reducer(blog_id, mentions_google): output(agg_id, sum(blog_id, mentions_google)); Sum up all the google mentions from all blogs def reducer(agg_id, mentions_google): output(count, sum(agg_id, mentions_google);

12 Google s MapReduce Programming Model * Slide taken from tutorial by Jerry Zhao and Jelena Pjesivac-Grovic (Google Inc.): MapReduce The Programming Model and Practice. Tutorial held at SIGMETRICS 2009.

13 MapReduce Programming Model Input Data type: key-value records Map function: (K in, V in ) list(k inter, V inter ) Reduce function: (K inter, list(v inter )) list(k out, V out )

14 MapReduce Programming Model map Takes input records - one by one - key, value Processes records - Independently Outputs intermediate - 1..n per input record - key, value reduce Takes intermediate results - Groups with same key - key, value [] Processes records - Group-wise Outputs result - Per group - Any format

15 MapReduce Workflow partition1 partition2 read worker worker write read worker write file(s) partition3 worker input files map phase intermediate output reduce phase output files

16 MapReduce Basics MapReduce programming model (and framework) that hides the complexity of work distribution and fault tolerance Principal design philosophies: - Near-linear scalability for data sets - Low cost reduce hardware, programming and admin costs MapReduce is not suitable for all problems, but when it works, it may save you quite a bit of time

17 MapReduce-Like Implementations Google MR Hadoop Dryad Availability Proprietary Open Source Proprietary Used by Google Yahoo!, Facebook, Amazon (EC2!), Twitter Microsoft Native API C++ Java C++

18 Load Balancing, Failure, and Stragglers Load Balancing - Break a MapReduce job in small tasks - Schedule tasks on workers as they report idle status MapReduce functions are side-effect free - Enables failed (and partially completed) tasks to be reexecuted without any problems (on a different machine) - When a worker fails, its tasks can be reallocated to other workers Identify and handle stragglers (slow workers) - Restart slow tasks on new workers - Stragglers appear with increasing probability when there are an increasing numbers of workers

19 Components in a Hadoop MR Workflow Next few slides are from:

20 Job Submission

21 Initialization

22 Scheduling

23 Execution

24 Pig Latin: a relational data-flow language MapReduce programs are quite low-level. A higher-level programming model was needed for processing semi-structured data sets using the MapReduce platform Pig Latin is a procedural, relational data-flow language that is implemented using MapReduce Pig Engine: Parser, Optimizer and distributed query execution

25 Pig Example* Input: User profiles, Page visits Problem: Find the top 5 most visited pages by users aged Load Users Load Pages Filter by Age Join on name Group on url Count clicks Order by clicks Take top 5 *Example taken from Yahoo Hadoop tutorial from Middleware

26 In MapReduce

27 In PigLatin Users = load users as (name, age); Filtered = filter Users by age >= 18 and age <= 25; Pages = load pages as (user, url); Joined = join Filtered by name, Pages by user; Grouped = group Joined by url; Summed = foreach Grouped generate group, COUNT(Joined) as clicks; Sorted = order Summed by clicks desc; Top5 = limit Sorted 5; store Top5 into top5sites ;

28 Pig Latin Data Types Tuple: Ordered set of fields - Field can be simple or complex type - Nested relational model Bag: Collection of tuples. - Can contain duplicates Map: Set of (key, value) pairs Primitive types - int, long, float, double, chararray, bytearray

29 Pig Architecture

30 Hive and HBase Hive and Pig were parallel projects developed at Facebook and Yahoo, respectively. HiveQL is closer to SQL from traditional RDBMSs than Pig Latin (which is procedural). - Due to the limitations of MapReduce, and the fact that HiveQL is compiled into a MapReduce query plan, HiveQL is a cut-down version of SQL HBase is an open-source, distributed, versioned, columnoriented store modeled after Google's Bigtable. A lot of Hive programs run over HBase.

31 MapReduce Filesystem Requirements MapReduce jobs run on data stored in files Support large files - Streaming reads - Mostly append to end (easier concurrency) Scalability - Add machines to scale Workers tasks use data on their local machine - Bandwidth is the bottleneck. Move code to data. Expect failures - Transparently handle failures as much as possible.

32 Hadoop Filesystem (HDFS) Supports huge data sets and large files Gigabytes files, petabyte data sets Supports tens of millions of files in a file system Files have write-once-read-many semantics Clients can only append to existing files Designed to run on COTS hardware - Implemented in Java Timely detection and recovery from data node faults Batch processing rather than interactive user access

33 HDFS Clusters can be very large! HDFS at Facebook (May 10) 21 PB of storage in a single HDFS cluster containing 2000 machines machines with 8 cores machines with 16 cores 12 TB per machine (some have 24 TB) 32 GB of RAM per machine

34 HDFS Architecture Metadata ops Namenode Metadata(Name, replicas..) (/home/foo/data,6... Client Read Write Datanodes Block ops Datanodes replication B Blocks Rack1 Rack2 Image from

35 Typical Hadoop Cluster 40 nodes/rack, nodes in cluster 1 Gbps within a rack; 8 Gbps between racks Aggregation switch Rack switch Image from

36 The HDFS NameNode A single Namenode manages the file system metadata and regulates access to files by clients. FileName->[BlockIds] BlockIds->[replica locations] Controls replication of blocks to DataNodes - Listens for Heartbeats from DataNodes - Signals creating, opening, closing of blocks - Load balancing, rack-aware distribution It is a single-point of failure (as of May 12). Ongoing work on a master-slave failover model.

37 HDFS DataNodes DataNodes store a set of blocks File are split into one or more blocks (64MB default). NameNode sends instructions to DataNodes for block creation, deletion, and replication. Clients read/write data in blocks at DataNodes

38 HDFS Client Read and write HDFS data Create checksum files for files in HDFS. - Recompute the checksum when a file has been read. Caches file data to reduce load on NameNode

39 Related Filesystems Google File System (GFS) - Kosmos File System (KFS) - Open source, Scales better than HDFS C++ implementation Master polls data nodes Supports writing to multiple arbitrary positions in files and file appending

40 Introduction to Cloud Computing

41 Democratization of Large-Scale Computing Cloud computing is the delivery of hosting services that are provided to a client over the Internet. - Enable large-scale services without up-front investment. New programming tools, databases and systems have enabled the low-cost construction of large-scale services.

42 NIST Definition of Cloud Computing "Cloud computing is a model for enabling convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction."

43 Supporting Technologies Enormous computer data-centres containing commodity hardware. Virtualization of computation, storage, and communication. - Turn hardware and networking into software! Achieve economies of scale. - Reduce costs of electricity, bandwidth, hardware, software and use low-cost locations. - Lower-cost than provisioning own hardware. NoSQL datastores have enabled storage scalability to much higher levels than relational databases.

44 Cloud Computing Essentials Cloud computing is Utility Computing - Cloud services are controlled and monitored by the cloud provider typically through a pay-per-use business model. An ideal cloud computing platform is: - efficient in its use of resources - scalable elastic - self-managing - highly available and accessible - inter-operable and portable

45 Cloud Properties Resource efficiency: computing and network resources are pooled to provide services to multiple users. Resource allocation is dynamically adapted according to user demand. Elasticity: computing resources can be rapidly and elastically provisioned to scale up, and released to scale down based on consumer s demand.

46 Cloud Properties Self-managing services: a consumer can provision cloud services, such as web applications, server time, processing, storage and network as needed and automatically without requiring human interaction with each service s provider Accessible and highly available: cloud resources are available over the network anytime and anywhere and are accessed through standard mechanisms that promote use by different types of platform (e.g., mobile phones, laptops, and PDAs).

47 IaaS, PaaS and SaaS Infrastructure as a Service (IaaS) Platform as a Service (PaaS) Software as a Service (SaaS) PaaS SaaS Applications Packaged Software IaaS Infrastructure Servers Storage Network Platform OS & Application Stack Infrastructure Servers Storage Network Platform OS & Application Stack Infrastructure Servers Storage Network

48 Spectrum of Cloud Users Image credit:

49 Infrastructure as a Service (IaaS) Virtualization - Virtualization is the abstraction of logical resources away from underlying physical resources. A hypervisor virtualizes a platform s operating system. The hypervisor manages OS as virtual machines (VMs), enabling multiple OS to share the same physical hardware. VM1 VM2 VM3

50 Virtualizing the Network and Storage

51 KVM (Kernel-based Virtual Machine) VMWare and Xen are the best-known virtualization platforms. KVM (Kernel-based Virtual Machine) is an opensource virtualization platform - Linux host OS Run multiple virtual machines (Windows, MAC, etc) on your linux box - IO is virtualized using a device model in KVM KVM requires a modified QEMU (open-source processor emulator) for IO virtualization framework.

52 Virtualization using KVM in Linux KVM is a loadable kernel module - kvm.ko provides the core virtualization infrastructure - kvm-intel.ko / kvm-amd.ko processor specific modules

53 IO Device Model in KVM Original approach with full-virtualization - Guest hardware accesses are intercepted by KVM - QEMU emulates hardware behavior of common devices Video cards PCI Input devices (mouse, keyboard) NICs

54 IaaS is Not Enough IaaS provides virtual machines, but it cannot provide elastic computing by itself, where services scale up and down to meet user demand. - Dynamic provisioning Existing IaaS do not provide support for the sharing middleware platforms among different VMs - Multi-tenancy

55 You Manage You Manage You Manage From IaaS to PaaS Traditional Stack IaaS PaaS Applications Applications Applications Data Data Data Runtime Runtime Runtime Middleware Middleware Middleware OS Servers Storage Networking OS Virtualization Servers Storage Networking Provider Manages OS Virtualization Servers Storage Networking Provider Manages

56 PaaS Platform as a Service (PaaS) is a computing platform that abstracts the infrastructure, OS, and middleware to drive developer productivity. PaaS leverages dynamic provisioning PaaS also leverages multi-tenancy

57 Resources Resources Resources Under-Provisioning In traditional computing, underestimating system utilization results first in lost revenue, then lost customers Capacity Time (days) Demand Capacity Demand Lost Revenue Capacity Demand Lost Users

58 Resources Over-Provisioning Overestimating system utilization results in higher than necessary infrastructure costs Capacity Unused resources Time Demand Dynamically provisioning resources solves the under-/over-provisioning problem.

59 Dynamic Provisioning Cloud computing enables server computing instances to be provisioned or deployed from a administrative console or client application by the server administrator, network administrator, or any other enabled user. Self-managing systems perform dynamic provisioning on behalf of a user or administrator in order to ensure quality of service (QoS) contracts are not broken and/or to meet some policy objectives.

60 How do we reuse middleware services? Image a single physical machine that is currently running 10 virtual machines (VMs), where each VM running has 5 active java programs. Assuming no virtualized application server, how many JVMs processes are running on the physical machine?

61 Multi-Tenancy Multi-tenancy is where a single instance of the software runs on a server, serving multiple clients. - Think multiple users in a MySQL database - Java 8 will support multi-tenancy (many java programs running in the same JVM) The software should be able to provide a single service to all customers by setting configurations - More efficient use of server resources

62 PaaS

63 Software as a Service (Saas)

64 Deployment Model There are four primary cloud deployment models : - Public Cloud - Private Cloud - Community Cloud - Hybrid Cloud

65 Public Clouds Public clouds are owned by cloud service providers who charge for the use of cloud resources. Basic characteristics: - Homogeneous infrastructure, Common policies - Shared resources and multi-tenancy - Leased or rented infrastructure - Economies of scale EC2 (Amazon) Elastic Compute Cloud. General purpose computing. Azure (Microsoft) General purpose computing on a Microsoft platform. AppEngine (Google) Build scalable web applications fast. Not general purpose.

66 Amazon, Microsoft, Google Cloud Offerings Amazon Web Services Microsoft Azure Google AppEngine Computation model(vm) x86 ISA via Xen VM Microsoft CLR VM Predefined application structure and framework Storage model SimpleDB, S3 SQL Data Services MegaStore/BigTable Networking model Declarative specification of IP level topology Automatic based on programmer s declarative descriptions of app components Fixed topology to accommodate 3-tier Web app structure

67 Private Clouds The cloud infrastructure belongs to and is operated by only one organization. Basic characteristics : - Heterogeneous infrastructure; Customized policies - Dedicated resources - In-house infrastructure; End-to-end control Examples include:

68 Public vs. Private Public Cloud Private Cloud Infrastructure Homogeneous Heterogeneous Policy Model Common defined Customized & Tailored Resource Model Shared & Multi-tenant Dedicated Cost Model Economy Model Operational expenditure Large economy of scale Capital expenditure End-to-end control

69 Other types of Clouds Community cloud - The cloud infrastructure is shared by several organizations and supports a specific community that has shared concerns (e.g., mission, security requirements, policy, and compliance considerations). Hybrid cloud - The cloud infrastructure is a composition of two or more clouds (private, community, or public) that remain unique entities but are bound together by standardized or proprietary technology that enables data and application portability.

70 Obstacles To Cloud Computing Data Lock-in No standardized APIs. - OpenStack API Data Confidentiality/Auditability Data transfer bottlenecks/costs Performance unpredictability for systems apps

71 Cloud Computing Summary

72 Challenges for Cloud computing Will all data migrate to the cloud? - The post-pc era. By 2015 some 6.3 exabytes of mobile data will be flowing each month*. How much will end up in the cloud? What will we do with all the new data generated by the Internet of Things and DNA sequencing machines? How will we manage security, ownership and migration of data stored in the cloud? *

73 References NIST (National Institute of Standards and Technology). Dean et al., MapReduce: simplified data processing on large clusters, Comms of ACM, vol 51(1), Armburst et al., Above the Clouds: A Berkeley View of Cloud Computing

74 References Cloud Computing: Principles and Paradigms, R. Buyya et al. (eds.), Wiley, Cloud Computing: Principles, Systems and Applications, L. Gillam et al. (eds.) Springer, Jeffrey Dean and Sanjay Ghemawat: MapReduce: Simplified Data Processing on Large Clusters in OSDI 2004 Senjay Ghemawat, : The Google File System. SIGOPS Operating Systems Review 37(5), 2003 M. Isard et al.: Dryad: Distributed Data-parallel Programs from Sequential Building Blocks in EuroSys 2007