Scheduling in the Cloud

Similar documents

Expanding the Horizons of Cloud Computing Beyond the Data Center. Jon Weissman Department of CS&E University of Minnesota

Using Proxies to Accelerate Cloud Applications

Living on the Edge: Scheduling Edge Resources Across the Cloud

Hadoop Parallel Data Processing

Map Reduce / Hadoop / HDFS

Apache Hadoop. Alexandru Costan

Exploring MapReduce Efficiency with Highly-Distributed Data

Energy Efficient MapReduce

Chapter 7. Using Hadoop Cluster and MapReduce

Hadoop Architecture. Part 1

Distribution transparency. Degree of transparency. Openness of distributed systems

Leveraging BlobSeer to boost up the deployment and execution of Hadoop applications in Nimbus cloud environments on Grid 5000

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Lecture 24: WSC, Datacenters. Topics: network-on-chip wrap-up, warehouse-scale computing and datacenters (Sections )

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc

Introduction to Cloud Computing

How To Use Hadoop

CSci 8980 Mobile Cloud Computing. Outsourcing: Components

Hadoop & its Usage at Facebook

Enabling the Use of Data

Big Data With Hadoop

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Introduction to Cloud : Cloud and Cloud Storage. Lecture 2. Dr. Dalit Naor IBM Haifa Research Storage Systems. Dalit Naor, IBM Haifa Research

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

A very short Intro to Hadoop

Hadoop & its Usage at Facebook

Jeffrey D. Ullman slides. MapReduce for data intensive computing

CSE-E5430 Scalable Cloud Computing Lecture 2

Fault Tolerance in Hadoop for Work Migration

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU

BlobSeer: Towards efficient data storage management on large-scale, distributed systems

LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT

CS246: Mining Massive Datasets Jure Leskovec, Stanford University.

Storage Architectures for Big Data in the Cloud

A programming model in Cloud: MapReduce

MINIMIZING STORAGE COST IN CLOUD COMPUTING ENVIRONMENT

Data-Intensive Computing with Map-Reduce and Hadoop

GraySort on Apache Spark by Databricks

MapReduce Job Processing

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

Enhancing MapReduce Functionality for Optimizing Workloads on Data Centers

Matsu Workflow: Web Tiles and Analytics over MapReduce for Multispectral and Hyperspectral Images

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems

San Diego Supercomputer Center, UCSD. Institute for Digital Research and Education, UCLA

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

Cloud Computing using MapReduce, Hadoop, Spark

Hadoop Cluster Applications

Advanced Computer Networks. Scheduling

Architecting for the next generation of Big Data Hortonworks HDP 2.0 on Red Hat Enterprise Linux 6 with OpenJDK 7

Big Data: A Storage Systems Perspective Muthukumar Murugan Ph.D. HP Storage Division

MapReduce and Hadoop Distributed File System

Chapter 17: Distributed Systems

Tutorial for Assignment 2.0

Research Article Hadoop-Based Distributed Sensor Node Management System

A Locality Enhanced Scheduling Method for Multiple MapReduce Jobs In a Workflow Application

Tutorial for Assignment 2.0

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee June 3 rd, 2008

Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications

Google File System. Web and scalability

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data

Parallel Computing. Benson Muite. benson.

Big Data on AWS. Services Overview. Bernie Nallamotu Principle Solutions Architect

Cloud Computing. and Scheduling. Data-Intensive Computing. Frederic Magoules, Jie Pan, and Fei Teng SILKQH. CRC Press. Taylor & Francis Group

Mining Large Datasets: Case of Mining Graph Data in the Cloud

Cloud Platforms, Challenges & Hadoop. Aditee Rele Karpagam Venkataraman Janani Ravi

Data-intensive HPC: opportunities and challenges. Patrick Valduriez

Cloud Computing Where ISR Data Will Go for Exploitation

Using distributed technologies to analyze Big Data

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS

Chapter 7: Distributed Systems: Warehouse-Scale Computing. Fall 2011 Jussi Kangasharju

HPC technology and future architecture

Analysing Large Web Log Files in a Hadoop Distributed Cluster Environment

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Big Data Processing with Google s MapReduce. Alexandru Costan

Cloud Federation to Elastically Increase MapReduce Processing Resources

Parallel Processing of cluster by Map Reduce

A Framework for Performance Analysis and Tuning in Hadoop Based Clusters

Cloud Computing For Bioinformatics

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Matchmaking in the Cloud: Amazon EC2 and Apache Hadoop at eharmony

ISSN: Keywords: HDFS, Replication, Map-Reduce I Introduction:

RAMCloud and the Low- Latency Datacenter. John Ousterhout Stanford University

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

Big Fast Data Hadoop acceleration with Flash. June 2013

Network-Aware Scheduling of MapReduce Framework on Distributed Clusters over High Speed Networks

MapReduce and Hadoop Distributed File System V I J A Y R A O

Data Center Specific Thermal and Energy Saving Techniques

Hadoop Development & BI- 0 to 100

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Challenges for Data Driven Systems

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

A Service for Data-Intensive Computations on Virtual Clusters

Web DNS Peer-to-peer systems (file sharing, CDNs, cycle sharing)

bigdata Managing Scale in Ontological Systems

Data Management in the Cloud

Duke University

Virtual Machine Based Resource Allocation For Cloud Computing Environment

Introduction to DISC and Hadoop

Transcription:

Scheduling in the Cloud Jon Weissman Distributed Computing Systems Group Department of CS&E University of Minnesota

Introduction Cloud Context fertile platform for scheduling research re-think old problems in new context Two scheduling problems mobile applications across the cloud multi-domain MapReduce

The Standard Cloud Data in Computation Results out No limits Storage Computing

Multiple Data Centers Virtual Containers

Cloud Evolution => Scheduling Client technology devices: smart phones, ipods, tablets, sensors Big data 4th paradigm for scientific inquiry Multiple DCs/clouds global services Science clouds explicit support for scientific applications Economics power and cooling green clouds

Our Focus Power at the edge local clouds, ad-hoc clouds Nebula Cloud-2-Cloud multiple clouds Big data locality, in-situ Mobile user user-centric cloud Proxy DMapReduce Mobile cloud

Mobility Trend: Mobile Cloud Mobile users/applications: phones, tablets resource limited: power, CPU, memory applications are becoming sophisticated Improve mobile user experience performance, reliability, fidelity tap into the cloud based on current resource state, preferences, interests => user-centric cloud processing

Cloud Mobile Opportunity Dynamic outsourcing move computation, data to the cloud dynamically User context exploit user behavior to pre-fetch, pre-compute, cache

Application Partitioning Outsourcing model local data capture + cloud processing images/video, speech, digital design, aug. reality Server Server Server Server. Server cloud end Code repository Proxy Outsourcing Client mobile end. Application Profiler Outsourcing Controller

Application Model: Coarse- Grain Dataflow for i=0 to NumImagePairs a = ImEnhance.sharpen (seta[i],...); b = ImAdjust.autotrim (setb[i],...); c = ImSizing.distill (a, resolution); d = ImChange.crop (b, dimensions); e = ImJoin.stitch (c, d,...); URL.upload (www.flickr.com,..., e); end-for

Scheduling Setup Components i, j, Aij - amt of data flow between components i and j Platforms α, β, γ,... (mobile, cloud, server, ) Dα,i.type execute time, power consumed for i running on α Linkαβ,k.type transmit time, power consumed for kth link between αβ All assumed to be w/r Input I

2 Experimental Results -Image Response time both WIFI & 3G up to 27 speedup 219K, WIFI Power consumption save up to 9 times 219K, WIFI Sharpening Avg. Time Avg. Power

3 Experimental Results-Face Detection Avg. Time Face Detection identify faces in an image Tradeoffs power, response User specifies tradeoffs Avg. Power

Big Data Trend: MapReduce Large-Scale Data Processing Want to use 1000s of CPUs on TBs of data MapReduce provides Automatic parallelization & distribution Fault tolerance User supplies two functions: map reduce

Inside MapReduce MapReduce cluster set of nodes N that run MapReduce job specify number of mappers, reducers, <= N master-worker paradigm Data set is first injected into DFS Data set is chunked (64 MB), replicated three times to the local disks of machines Master scheduler tries to run map jobs and reduce jobs on workers near the data

MapReduce Workflow DFS push shuffle

Big Data Trend: Distribution Big data is distributed earth science: weather data, seismic data life science: GenBank, NCI BLAST, PubMed health science: GoogleEarth + CDC pandemic data web 2.0: user multimedia blogs DFS push

Context: Widely distributed data Data in different data-centers Run MapReduce across them Data-flow spanning wide-area networks

Data Scheduling: Wide-Area MapReduce Local MapReduce (LMR) Global MapReduce (GMR) Distributed MapReduce (DMR)

DMR is a great idea if output << input LMR and GMR are better in other settings PlanetLab Amazon EC-2

Intelligent Data Placement HDFS local cluster, nearby rack, random rack Resource Topology Application Characteristics static or observed /DCi/rackA/nodeX????? Data placement Scheduling LMR, DMR, GMR

Problem: Data Scheduling Data movement is dominant Data sets located in domains, size: Di, Dm Platform domains: Pj, Pk Inter-platform bandwidth: BDiPj Data expansion factors input->intermediate, α Intermediate->output, β => select LMR, DMR, GMR

Summary Cloud Evolution mobile users, big data, multiple clouds/data centers many scheduling challenges Cloud Opportunities new context for old problems application partitioning (mobile/cloud) data scheduling (wide-area MapReduce)