Cof low. Mending the Application-Network Gap in Big Data Analytics. Mosharaf Chowdhury. UC Berkeley

Size: px
Start display at page:

Download "Cof low. Mending the Application-Network Gap in Big Data Analytics. Mosharaf Chowdhury. UC Berkeley"

Transcription

1 Cof low Mending the Application-Network Gap in Big Data Analytics Mosharaf Chowdhury UC Berkeley

2 Big Data The volume of data businesses want to make sense of is increasing Increasing variety of sources Web, mobile, wearables, vehicles, scientific, Cheaper disks, SSDs, and memory Stalling processor speeds

3 Big Datacenters for Massive Parallelism BlinkDB Storm Pregel DryadLINQ MapReduce 2005 Hadoop Dryad GraphLab Spark Spark-Streaming GraphX Dremel Hive

4 Data-Parallel Applications Multi-stage dataflow Computation interleaved with communication Computation Stage (e.g., Map, Reduce) Distributed across many machines Tasks run in parallel Communication Stage (e.g., Shuffle) Between successive computation stages Reduce Stage A communication stage cannot complete until all the data have been transferred Map Stage

5 Communication is Crucial Performance Facebook jobs spend ~25% of runtime on average in intermediate comm. 1 As SSD-based and in-memory systems proliferate, the network is likely to become the primary bottleneck 1. Based on a month-long trace with 20,000 jobs and 150 Million tasks, collected from a 000-machine Facebook production MapReduce cluster.

6 Flow Transfers data from a source to a destination Independent unit of allocation, sharing, load balancing, and/or prioritization Faster Communication Stages: Networking Approach Configuration should be handled at the system level

7 Existing Solutions WFQ CSFQ D DeTail PDQ pfabric GPS RED ECN XCP RCP DCTCP D 2 TCP FCP 1980s 1990s 2000s Per-Flow Fairness Flow Completion Time Independent flows cannot capture the collective communication behavior common in data-parallel applications

8 Why Do They Fall Short? r 1 r 2 s 1 s 2 s

9 Why Do They Fall Short? r 1 r 2 s r 1 s 1 s 2 s s 2 2 s 1 s 2 s Datacenter 2 r 2 s Network

10 Why Do They Fall Short? s r 1 s r 2 s Datacenter Network Link to r 1 Link to r 2 Per-Flow Fair Sharing time 5 5 Shuffle Completion Time = 5 Avg. Flow Completion Time =.66 Solutions focusing on flow completion time cannot further decrease the shuffle completion time

11 Improve Application-Level Performance 1 s 1 s 2 s 1 2 Datacenter Network 1 2 r 1 r 2 Slow down faster flows to accelerate slower flows Link to r 1 Per-Flow Fair Sharing 5 Shuffle Completion Time = 5 Link to r 1 Data-Proportional Per-Flow Fair Sharing Allocation Shuffle Completion Time = 4 Link to r time 5 Avg. Flow Completion Time =.66 Link to r time Avg. Flow Completion Time = 4 1. Managing Data Transfers in Computer Clusters with Orchestra, SIGCOMM 2011.

12 Faster Communication Stages: Systems Approach Applications know their performance goals, but they have no means to let the network know Configuration should be handled by the end users

13 Faster Communication Stages: Systems Approach Faster Communication Stages: Networking Approach Configuration should be handled by the end users Configuration should be handled at the system level

14 ME MIND THE GAP Holistic Approach Applications and the Network Working Together

15 Cof low Communication abstraction for data-parallel applications to express their performance goals 1. Minimize completion times, 2. Meet deadlines, or. Perform fair allocation.

16 Broadcast Aggregation All-to-All Single Flow Shuffle Parallel Flows

17 1 1 for faster #1 completion of coflows? How to schedule coflows online to meet #2 more deadlines? N. Datacenter. N for fair # allocation of the network?

18 Varys 1 Enables coflows in data-intensive clusters 1. Coflow Scheduler 2. Global Coordination. The Coflow API Faster, application-aware data transfers throughout the network Consistent calculation and enforcement of scheduler decisions Decouples network optimizations from applications, relieving developers and end users 1. Efficient Coflow Scheduling with Varys, SIGCOMM 2014.

19 Cof low Communication abstraction for data-parallel applications to express their performance goals 1. The size of each flow, 2. The total number of flows, and. The endpoints of individual flows.

20 Benefits of Inter-Coflow Scheduling Coflow 1 Coflow 2 Link 2 Link 1 Units 6 Units 2 Units Fair Sharing Smallest-Flow First 1,2 The Optimal L2 L1 L2 L1 L2 L time Coflow1 comp. time = 5 Coflow2 comp. time = time Coflow1 comp. time = 5 Coflow2 comp. time = time Coflow1 comp. time = Coflow2 comp. time = 6 1. Finishing Flows Quickly with Preemptive Scheduling, SIGCOMM pfabric: Minimal Near-Optimal Datacenter Transport, SIGCOMM 201.

21 Inter-Coflow Scheduling is NP-Hard Coflow 1 Coflow 2 Link 2 Link 1 Units 6 Units 2 Units L2 L1 Fair Sharing Concurrent Open Shop Scheduling 1 Examples include job scheduling and L1 caching blocks time Solutions use a ordering heuristic Coflow1 comp. time = 6 Coflow2 comp. time = 6 L2 Flow-level Prioritization time Coflow1 comp. time = 6 Coflow2 comp. time = 6 L2 L1 The Optimal time Coflow1 comp. time = Coflow2 comp. time = 6 1. Finishing Flows Quickly with Preemptive Scheduling, SIGCOMM A pfabric: Note on Minimal the Complexity Near-Optimal of the Datacenter Concurrent Transport, Open Shop SIGCOMM 201. Problem, Journal of Scheduling, 9(4):89 96, 2006

22 Inter-Coflow Scheduling is NP-Hard Coflow 1 Coflow 2 Link 2 Link 1 Units 6 Units 2 Units Concurrent Open Shop Scheduling with Coupled Resources Examples include job scheduling and caching blocks Solutions use a ordering heuristic Consider matching constraints 6 2 Input Links Output Links 1 1 Datacenter

23 Varys Employs a two-step algorithm to minimize coflow completion times 1. Ordering heuristic 2. Allocation algorithm Keep an ordered list of coflows to be scheduled, preempting if needed Allocates minimum required resources to each coflow to finish in minimum time

24 Ordering Heuristic C 1 ends C 2 ends C ends O 1 O O 5 Datacenter C 1 C 2 C Length 5 6 Time 1 19 Shortest-First (Total CCT = 5)

25 Ordering Heuristic C 1 ends C 2 ends C ends C ends C 2 ends C 1 ends O 1 O 1 O 2 O O O 5 Datacenter C 1 C 2 C Width 2 1 Time 1 19 Shortest-First (5) 6 Time Narrowest-First (Total CCT = 41)

26 Ordering Heuristic C 1 ends C 2 ends C ends C ends C 2 ends C 1 ends O 1 O 1 O 2 O O O 5 Datacenter Time 1 19 Shortest-First (5) 6 Time Narrowest-First (41) C 1 C 2 C Size O 1 O 2 C ends C 1 ends C 2 ends O 6 9 Time Smallest-First 19 (4)

27 Ordering Heuristic C 1 ends C 2 ends C ends C ends C 2 ends C 1 ends O 1 O 1 O 2 O O O 5 Datacenter Time 1 19 Shortest-First (5) 6 Time Narrowest-First (41) C 1 C 2 C Bottleneck 10 6 O 1 O 2 C ends C 1 ends C 2 ends C 1 ends O 1 O 2 C ends C 2 ends O O 6 9 Time Smallest-First 19 9 Time 19 Smallest-Bottleneck (4) (1)

28 Allocation Algorithm A coflow cannot finish before its very last flow Finishing flows faster than the bottleneck cannot decrease a coflow s completion time Allocate minimum flow rates such that all flows of a coflow finish together on time

29 Varys Enables coflows in data-intensive clusters 1. Coflow Scheduler 2. Global Coordination. The Coflow API Faster, application-aware data transfers throughout the network Consistent calculation and enforcement of scheduler decisions Decouples network optimizations from applications, relieving developers and end users

30 The Need for Coordination C 1 ends C 2 ends O O 2 O Time C 1 C 2 Bottleneck 4 5 Scheduling with Coordination (Total CCT = 1)

31 The Need for Coordination C 1 ends C 2 ends C 1 ends C 2 ends O 1 O O 2 O O 2 O Time Scheduling with Coordination 7 Time 12 Scheduling without Coordination (Total CCT = 1) (Total CCT = 19) Uncoordinated local decisions interleave coflows, hurting performance

32 Varys Architecture Centralized master-slave architecture Applications use a client library to communicate with the master Actual timing and rates are determined by the coflow scheduler Sender Receiver Driver Put Get Reg Varys Daemon Topology Monitor Usage Estimator Coflow Scheduler Varys Master Varys Daemon TaskName f Network Interface (Distributed) File System Comp. Tasks calling Varys Client Library Varys Daemon 1. Download from

33 Varys Enables coflows in data-intensive clusters 1. Coflow Scheduler 2. Global Coordination. The Coflow API Faster, application-aware data transfers throughout the network Consistent calculation and enforcement of scheduler decisions Decouples network optimizations from applications, relieving developers and end users

34 The Coflow API 1. NO changes to user jobs 2. NO storage management register b register(broadcast) s register(shuffle) put shuffle id b.put(content) get unregister broadcast mappers driver (JobTracker) b.unregister() b.get(id) id s1 s.get(id sl )

35 Evaluation A 000-machine trace-driven simulation matched against a 100-machine EC2 deployment 1. Does it improve performance? YES 2. Can it beat non-preemptive solutions?. Do we really need coordination?

36 Better than Per-Flow Fairness Comm. Heavy Comm. Improv. Job Improv. EC2.16X 1.85X 2.50X 1.25X Sim. 4.86X.21X.9X 1.11X

37 Preemption is Necessary [Sim.] Overhead Over Varys NO What About Starvation 0 1 2, 4 Varys Per-Flow Fair FIFO Per-Flow Priority FIFO-LM Varys NC Fairness Prioritization 1. Managing Data Transfers in Computer Clusters with Orchestra, SIGCOMM 2011

38 Lack of Coordination Hurts [Sim.] Overhead Over Varys , 4 Varys Per-Flow Fair FIFO Per-Flow Priority FIFO-LM Varys NC Fairness Prioritization Smallest-flow-first (per-flow priorities) Minimizes flow completion time FIFO-LM 4 performs decentralized coflow scheduling Suffers due to local decisions Works well for small, similar coflows 1. Managing Data Transfers in Computer Clusters with Orchestra, SIGCOMM Finishing Flows Quickly with Preemptive Scheduling, SIGCOMM pfabric: Minimal Near-Optimal Datacenter Transport, SIGCOMM Decentralized Task-Aware Scheduling for Data Center Networks, SIGCOMM 2014

39 Cof low Communication abstraction for data-parallel applications to express their performance goals 1. The size of each flow, 2. The total number of flows, and. The endpoints of individual flows. ç Pipelining between stages ç Speculative executions ç Task failures and restarts

40 How to Perform Coflow Scheduling Without Complete Knowledge?

41 Implications Minimize Avg. Comp. Time Flows in a Single Link Coflows in an Entire Datacenter With complete knowledge Smallest-Flow-First Ordering by Bottleneck Size + Data-Proportional Rate Allocation Without complete knowledge Least-Attained Service (LAS)?

42 Revisiting Ordering Heuristics C 1 ends C 2 ends C ends C ends C 2 ends C 1 ends O 1 O 1 O 2 O O O 5 Time 1 19 Shortest-First 6 Time Narrowest-First (5) (41) C ends C 1 ends C 2 ends C 1 ends C ends C 2 ends O 1 O 2 O O 1 O 2 O 6 9 Time Smallest-First 19 9 Time 19 Smallest-Bottleneck (4) (1)

43 Coflow-Aware LAS (CLAS) Set priority that decreases with how much a coflow has already sent The more a coflow has sent, the lower its priority Smaller coflows finish faster Use total size of coflows to set priorities Avoids the drawbacks of full decentralization

44 Coflow-Aware LAS (CLAS) Continuous priority reduces to fair sharing when similar coflows coexist Priority oscillation Coflow 1 Coflow time Coflow1 comp. time = 6 Coflow2 comp. time = 6 FIFO works well for similar coflows Avoids the drawbacks of full decentralization time Coflow1 comp. time = Coflow2 comp. time = 6

45 Discretized Coflow-Aware LAS (D-CLAS) Priority discretization Change priority when total size exceeds predefined thresholds Scheduling policies FIFO within the same queue Prioritization across queue Weighted sharing across queues Guarantees starvation avoidance FIFO FIFO FIFO Q K Q 2 Q 1 Lowest- Priority Queue Highest- Priority Queue

46 How to Discretize Priorities? Exponentially spaced thresholds K : number of queues A : threshold constant E : threshold exponent FIFO Q K E K-1 A Lowest- Priority Queue Loose coordination suffices to calculate global coflow sizes FIFO Q 2 Slaves make independent decisions in between E 2 A E 1 A Small coflows (smaller than E 1 A) do not experience coordination overheads! E 1 A FIFO 0 Q 1 Highest- Priority Queue

47 Closely Approximates Varys [Sim. & EC2] Overhead Over Varys , Varys Per-Flow Fair FIFO Per-Flow Priority FIFO-LM Varys NC Fairness Prioritization w/o Complete Knowledge 1. Managing Data Transfers in Computer Clusters with Orchestra, SIGCOMM Finishing Flows Quickly with Preemptive Scheduling, SIGCOMM pfabric: Minimal Near-Optimal Datacenter Transport, SIGCOMM Decentralized Task-Aware Scheduling for Data Center Networks, SIGCOMM 2014

48 My Contributions Orchestra SIGCOMM 11 Merged with Spark Varys Coflow SIGCOMM 14 Open-Source Aalo SIGCOMM 15 Open-Source Application-Aware Network Scheduling

49 My Contributions Orchestra SIGCOMM 11 Varys SIGCOMM 14 Aalo SIGCOMM 15 Application-Aware Network Scheduling

50 My Contributions Spark NSDI 12 Sinbad SIGCOMM 1 Network-Aware Applications Orchestra SIGCOMM 11 Varys SIGCOMM 14 Aalo SIGCOMM 15 Application-Aware Network Scheduling FairCloud SIGCOMM 12 HARP SIGCOMM 12 ViNEYard ToN 12 Datacenter Resource Allocation

51 My Contributions Spark NSDI 12 Top-Level Apache Project Orchestra SIGCOMM 11 Merged with Spark FairCloud Varys SIGCOMM 14 Open-Source HARP Sinbad SIGCOMM 1 Merged at Facebook Aalo SIGCOMM 15 Open-Source ViNEYard SIGCOMM 12 SIGCOMM 12 Bing Open-Source Network-Aware Applications Application-Aware Network Scheduling Datacenter Resource Allocation

52 Communication-First Big Data Systems In-Datacenter Analytics Cheaper SSDs and DRAM, proliferation of optical networks, and resource disaggregation will make network the primary bottleneck Inter-Datacenter Analytics Bandwidth-constrained wide-area networks End User Delivery Faster and responsive delivery of analytics results over the Internet for better end user experience

53 Systems ME MIND THE GAP Networking Better capture application-level performance goals using coflows Coflows improve application-level performance and usability Extends networking and scheduling literature Coordination even if not free is worth paying for in many cases

54

55 Improve Flow Completion Times s r 1 s r 2 s Datacenter Link to r 1 Per-Flow Fair Sharing Shuffle Completion Time = 5 Link to r 1 1 Smallest-Flow First 1,2 2 4 Shuffle Completion Time = 6 Link to r time Avg. Flow Completion Time =.66 Link to r time Avg. Flow Completion Time = Finishing Flows Quickly with Preemptive Scheduling, SIGCOMM pfabric: Minimal Near-Optimal Datacenter Transport, SIGCOMM 201.

56 Distributions of Coflow Characteristics Frac. of Coflows E+06 1.E+08 1.E+10 1.E+12 1.E+14 Coflow Length (Bytes) Frac. of Coflows E+00 1.E+04 1.E+08 Coflow Width (Number of Flows) Frac. of Coflows E+06 1.E+08 1.E+10 1.E+12 1.E+14 Coflow Size (Bytes) Frac. of Coflows E+06 1.E+08 1.E+10 1.E+12 1.E+14 Coflow Bottleneck Size (Bytes)

57 Traffic Sources 1. Ingest and replicate new data 2. Read input from remote machines, when needed. Transfer intermediate data 4. Write and replicate output Percentage of Traffic by Category at Facebook

58 Distribution of Shuffle Durations Performance Facebook jobs spend ~25% of runtime on average in intermediate comm. 1 Fraction of Jobs Fraction of Runtime Spent in Shuffle Month-long trace from a 000- machine MapReduce production cluster at Facebook 20,000 jobs 150 Million tasks

59 Theoretical Results Structure of optimal schedules Permutation schedules might not always lead to the optimal solution Approximation ratio of COSS-CR Polynomial-time algorithm with constant approximation ratio ( 64 ) 1 The need for coordination Fully decentralized schedulers can perform arbitrarily worse than the optimal 1. Due to Zhen Qiu, Cliff Stein, and Yuan Zhong from the Department of Industrial Engineering and Operations Research, Columbia University, 2014

60 The Coflow API 1. NO changes to user jobs 2. NO storage management register b register(broadcast, numflows) s register(shuffle, numflows, {b}) put shuffle id b.put(content, size) get unregister broadcast mappers driver (JobTracker) b.unregister() b.get(id) id s1 s.put(content, s.get(id sl )

61 Varys Employs a two-step algorithm to support coflow deadlines 1. Admission control 2. Allocation algorithm Do not admit any coflows that cannot be completed within deadline without violating existing deadlines Allocate minimum required resources to each coflow to finish them at their deadlines

62 More Predictable Facebook Trace Simulation EC2 Deployment % of Coflows % of Coflows Varys Fair EDF (Earliest-Deadline First) 1 0 Varys Fair Met Deadline Not Admitted Missed Deadline 1. Finishing Flows Quickly with Preemptive Scheduling, SIGCOMM 2012

63 Optimizing Communication # Comm. Params * Spark-v Performance: Systems Approach Let users figure it out Hadoop-v1.2.1 YARN-v * Lower bound. Does not include many parameters that can indirectly impact communication; e.g., number of reducers etc. Also excludes control-plane communication/rpc parameters.

64 Experimental Methodology Varys deployment in EC2 100 m2.4xlarge machines Each machine has 8 CPU cores, 68.4 GB memory, and 1 Gbps NIC ~900 Mbps/machine during all-to-all communication Trace-driven simulation Detailed replay of a day-long Facebook trace (circa October 2010) 000-machine,150-rack cluster with 10:1 oversubscription

Coflow! A Networking Abstraction For Cluster Applications! Mosharaf Chowdhury Ion Stoica! UC#Berkeley#

Coflow! A Networking Abstraction For Cluster Applications! Mosharaf Chowdhury Ion Stoica! UC#Berkeley# Coflow A Networking Abstraction For Cluster Applications Mosharaf Chowdhury Ion Stoica UC#Berkeley# Cluster Applications Multi-Stage Data Flows» Computation interleaved with communication Computation»

More information

Decentralized Task-Aware Scheduling for Data Center Networks

Decentralized Task-Aware Scheduling for Data Center Networks Decentralized Task-Aware Scheduling for Data Center Networks Fahad R. Dogar, Thomas Karagiannis, Hitesh Ballani, Ant Rowstron Presented by Eric Dong (yd2dong) October 30, 2015 Tasks in data centers Applications

More information

Apache Hadoop. Alexandru Costan

Apache Hadoop. Alexandru Costan 1 Apache Hadoop Alexandru Costan Big Data Landscape No one-size-fits-all solution: SQL, NoSQL, MapReduce, No standard, except Hadoop 2 Outline What is Hadoop? Who uses it? Architecture HDFS MapReduce Open

More information

Advanced Computer Networks. Scheduling

Advanced Computer Networks. Scheduling Oriana Riva, Department of Computer Science ETH Zürich Advanced Computer Networks 263-3501-00 Scheduling Patrick Stuedi, Qin Yin and Timothy Roscoe Spring Semester 2015 Outline Last time Load balancing

More information

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Introduction to Big Data! with Apache Spark UC#BERKELEY# Introduction to Big Data! with Apache Spark" UC#BERKELEY# This Lecture" The Big Data Problem" Hardware for Big Data" Distributing Work" Handling Failures and Slow Machines" Map Reduce and Complex Jobs"

More information

Unstructured Data Accelerator (UDA) Author: Motti Beck, Mellanox Technologies Date: March 27, 2012

Unstructured Data Accelerator (UDA) Author: Motti Beck, Mellanox Technologies Date: March 27, 2012 Unstructured Data Accelerator (UDA) Author: Motti Beck, Mellanox Technologies Date: March 27, 2012 1 Market Trends Big Data Growing technology deployments are creating an exponential increase in the volume

More information

Decentralized Task-Aware Scheduling for Data Center Networks

Decentralized Task-Aware Scheduling for Data Center Networks Decentralized Task-Aware Scheduling for Data Center Networks Fahad R. Dogar, Thomas Karagiannis, Hitesh Ballani, and Ant Rowstron Microsoft Research Abstract Most data center applications perform rich

More information

Decentralized Task-Aware Scheduling for Data Center Networks

Decentralized Task-Aware Scheduling for Data Center Networks Decentralized Task-Aware Scheduling for Data Center Networks Fahad R. Dogar, Thomas Karagiannis, Hitesh Ballani, and Antony Rowstron Microsoft Research {fdogar, thomkar, hiballan, antr}@microsoft.com ABSTRACT

More information

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15 Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases Lecture 15 Big Data Management V (Big-data Analytics / Map-Reduce) Chapter 16 and 19: Abideboul et. Al. Demetris

More information

GraySort on Apache Spark by Databricks

GraySort on Apache Spark by Databricks GraySort on Apache Spark by Databricks Reynold Xin, Parviz Deyhim, Ali Ghodsi, Xiangrui Meng, Matei Zaharia Databricks Inc. Apache Spark Sorting in Spark Overview Sorting Within a Partition Range Partitioner

More information

Chapter 7. Using Hadoop Cluster and MapReduce

Chapter 7. Using Hadoop Cluster and MapReduce Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in

More information

Joint Optimization of Overlapping Phases in MapReduce

Joint Optimization of Overlapping Phases in MapReduce Joint Optimization of Overlapping Phases in MapReduce Minghong Lin, Li Zhang, Adam Wierman, Jian Tan Abstract MapReduce is a scalable parallel computing framework for big data processing. It exhibits multiple

More information

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase Architectural patterns for building real time applications with Apache HBase Andrew Purtell Committer and PMC, Apache HBase Who am I? Distributed systems engineer Principal Architect in the Big Data Platform

More information

Task Scheduling in Hadoop

Task Scheduling in Hadoop Task Scheduling in Hadoop Sagar Mamdapure Munira Ginwala Neha Papat SAE,Kondhwa SAE,Kondhwa SAE,Kondhwa Abstract Hadoop is widely used for storing large datasets and processing them efficiently under distributed

More information

From GWS to MapReduce: Google s Cloud Technology in the Early Days

From GWS to MapReduce: Google s Cloud Technology in the Early Days Large-Scale Distributed Systems From GWS to MapReduce: Google s Cloud Technology in the Early Days Part II: MapReduce in a Datacenter COMP6511A Spring 2014 HKUST Lin Gu lingu@ieee.org MapReduce/Hadoop

More information

Survey on Scheduling Algorithm in MapReduce Framework

Survey on Scheduling Algorithm in MapReduce Framework Survey on Scheduling Algorithm in MapReduce Framework Pravin P. Nimbalkar 1, Devendra P.Gadekar 2 1,2 Department of Computer Engineering, JSPM s Imperial College of Engineering and Research, Pune, India

More information

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging Outline High Performance Computing (HPC) Towards exascale computing: a brief history Challenges in the exascale era Big Data meets HPC Some facts about Big Data Technologies HPC and Big Data converging

More information

Unified Big Data Analytics Pipeline. 连 城 lian@databricks.com

Unified Big Data Analytics Pipeline. 连 城 lian@databricks.com Unified Big Data Analytics Pipeline 连 城 lian@databricks.com What is A fast and general engine for large-scale data processing An open source implementation of Resilient Distributed Datasets (RDD) Has an

More information

Mesos: A Platform for Fine- Grained Resource Sharing in Data Centers (II)

Mesos: A Platform for Fine- Grained Resource Sharing in Data Centers (II) UC BERKELEY Mesos: A Platform for Fine- Grained Resource Sharing in Data Centers (II) Anthony D. Joseph LASER Summer School September 2013 My Talks at LASER 2013 1. AMP Lab introduction 2. The Datacenter

More information

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic BigData An Overview of Several Approaches David Mera Masaryk University Brno, Czech Republic 16/12/2013 Table of Contents 1 Introduction 2 Terminology 3 Approaches focused on batch data processing MapReduce-Hadoop

More information

ICS 143 - Principles of Operating Systems

ICS 143 - Principles of Operating Systems ICS 143 - Principles of Operating Systems Lecture 5 - CPU Scheduling Prof. Nalini Venkatasubramanian nalini@ics.uci.edu Note that some slides are adapted from course text slides 2008 Silberschatz. Some

More information

Big Data and Scripting Systems beyond Hadoop

Big Data and Scripting Systems beyond Hadoop Big Data and Scripting Systems beyond Hadoop 1, 2, ZooKeeper distributed coordination service many problems are shared among distributed systems ZooKeeper provides an implementation that solves these avoid

More information

Hopper: Decentralized Speculation-aware Cluster Scheduling at Scale

Hopper: Decentralized Speculation-aware Cluster Scheduling at Scale Hopper: Decentralized Speculation-aware Cluster Scheduling at Scale Xiaoqi Ren, Ganesh Ananthanarayanan, Adam Wierman, Minlan Yu California Institute of Technology, Microsoft Research, University of Southern

More information

Mammoth: Gearing Hadoop Towards Memory-Intensive MapReduce Applications

Mammoth: Gearing Hadoop Towards Memory-Intensive MapReduce Applications 1 Mammoth: Gearing Hadoop Towards Memory-Intensive MapReduce Applications Xuanhua Shi 1, Ming Chen 1, Ligang He 2,XuXie 1,LuLu 1, Hai Jin 1, Yong Chen 3, and Song Wu 1 1 SCTS/CGCL, School of Computer,

More information

Mammoth: Gearing Hadoop Towards Memory-Intensive MapReduce Applications

Mammoth: Gearing Hadoop Towards Memory-Intensive MapReduce Applications 1 Mammoth: Gearing Hadoop Towards Memory-Intensive MapReduce Applications Xuanhua Shi 1, Ming Chen 1, Ligang He 2,XuXie 1,LuLu 1, Hai Jin 1, Yong Chen 3, and Song Wu 1 1 SCTS/CGCL, School of Computer,

More information

Matchmaking: A New MapReduce Scheduling Technique

Matchmaking: A New MapReduce Scheduling Technique Matchmaking: A New MapReduce Scheduling Technique Chen He Ying Lu David Swanson Department of Computer Science and Engineering University of Nebraska-Lincoln Lincoln, U.S. {che,ylu,dswanson}@cse.unl.edu

More information

MAPREDUCE Programming Model

MAPREDUCE Programming Model CS 2510 COMPUTER OPERATING SYSTEMS Cloud Computing MAPREDUCE Dr. Taieb Znati Computer Science Department University of Pittsburgh MAPREDUCE Programming Model Scaling Data Intensive Application MapReduce

More information

Hadoop Scheduler w i t h Deadline Constraint

Hadoop Scheduler w i t h Deadline Constraint Hadoop Scheduler w i t h Deadline Constraint Geetha J 1, N UdayBhaskar 2, P ChennaReddy 3,Neha Sniha 4 1,4 Department of Computer Science and Engineering, M S Ramaiah Institute of Technology, Bangalore,

More information

Systems Engineering II. Pramod Bhatotia TU Dresden pramod.bhatotia@tu- dresden.de

Systems Engineering II. Pramod Bhatotia TU Dresden pramod.bhatotia@tu- dresden.de Systems Engineering II Pramod Bhatotia TU Dresden pramod.bhatotia@tu- dresden.de About me! Since May 2015 2015 2012 Research Group Leader cfaed, TU Dresden PhD Student MPI- SWS Research Intern Microsoft

More information

MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research

MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research MapReduce and Distributed Data Analysis Google Research 1 Dealing With Massive Data 2 2 Dealing With Massive Data Polynomial Memory Sublinear RAM Sketches External Memory Property Testing 3 3 Dealing With

More information

MapReduce Online. Tyson Condie, Neil Conway, Peter Alvaro, Joseph Hellerstein, Khaled Elmeleegy, Russell Sears. Neeraj Ganapathy

MapReduce Online. Tyson Condie, Neil Conway, Peter Alvaro, Joseph Hellerstein, Khaled Elmeleegy, Russell Sears. Neeraj Ganapathy MapReduce Online Tyson Condie, Neil Conway, Peter Alvaro, Joseph Hellerstein, Khaled Elmeleegy, Russell Sears Neeraj Ganapathy Outline Hadoop Architecture Pipelined MapReduce Online Aggregation Continuous

More information

Big Fast Data Hadoop acceleration with Flash. June 2013

Big Fast Data Hadoop acceleration with Flash. June 2013 Big Fast Data Hadoop acceleration with Flash June 2013 Agenda The Big Data Problem What is Hadoop Hadoop and Flash The Nytro Solution Test Results The Big Data Problem Big Data Output Facebook Traditional

More information

HiBench Introduction. Carson Wang (carson.wang@intel.com) Software & Services Group

HiBench Introduction. Carson Wang (carson.wang@intel.com) Software & Services Group HiBench Introduction Carson Wang (carson.wang@intel.com) Agenda Background Workloads Configurations Benchmark Report Tuning Guide Background WHY Why we need big data benchmarking systems? WHAT What is

More information

Guidelines for Selecting Hadoop Schedulers based on System Heterogeneity

Guidelines for Selecting Hadoop Schedulers based on System Heterogeneity Noname manuscript No. (will be inserted by the editor) Guidelines for Selecting Hadoop Schedulers based on System Heterogeneity Aysan Rasooli Douglas G. Down Received: date / Accepted: date Abstract Hadoop

More information

INFO5011. Cloud Computing Semester 2, 2011 Lecture 11, Cloud Scheduling

INFO5011. Cloud Computing Semester 2, 2011 Lecture 11, Cloud Scheduling INFO5011 Cloud Computing Semester 2, 2011 Lecture 11, Cloud Scheduling COMMONWEALTH OF Copyright Regulations 1969 WARNING This material has been reproduced and communicated to you by or on behalf of the

More information

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give

More information

Distributed Aggregation in Cloud Databases. By: Aparna Tiwari tiwaria@umail.iu.edu

Distributed Aggregation in Cloud Databases. By: Aparna Tiwari tiwaria@umail.iu.edu Distributed Aggregation in Cloud Databases By: Aparna Tiwari tiwaria@umail.iu.edu ABSTRACT Data intensive applications rely heavily on aggregation functions for extraction of data according to user requirements.

More information

Globus Striped GridFTP Framework and Server. Raj Kettimuthu, ANL and U. Chicago

Globus Striped GridFTP Framework and Server. Raj Kettimuthu, ANL and U. Chicago Globus Striped GridFTP Framework and Server Raj Kettimuthu, ANL and U. Chicago Outline Introduction Features Motivation Architecture Globus XIO Experimental Results 3 August 2005 The Ohio State University

More information

Big Data Processing with Google s MapReduce. Alexandru Costan

Big Data Processing with Google s MapReduce. Alexandru Costan 1 Big Data Processing with Google s MapReduce Alexandru Costan Outline Motivation MapReduce programming model Examples MapReduce system architecture Limitations Extensions 2 Motivation Big Data @Google:

More information

Performance and Energy Efficiency of. Hadoop deployment models

Performance and Energy Efficiency of. Hadoop deployment models Performance and Energy Efficiency of Hadoop deployment models Contents Review: What is MapReduce Review: What is Hadoop Hadoop Deployment Models Metrics Experiment Results Summary MapReduce Introduced

More information

Lecture Outline Overview of real-time scheduling algorithms Outline relative strengths, weaknesses

Lecture Outline Overview of real-time scheduling algorithms Outline relative strengths, weaknesses Overview of Real-Time Scheduling Embedded Real-Time Software Lecture 3 Lecture Outline Overview of real-time scheduling algorithms Clock-driven Weighted round-robin Priority-driven Dynamic vs. static Deadline

More information

Managing Data Transfers in Computer Clusters with Orchestra

Managing Data Transfers in Computer Clusters with Orchestra Managing Data Transfers in Computer Clusters with Orchestra Mosharaf Chowdhury, Matei Zaharia, Justin Ma, Michael I. Jordan, Ion Stoica University of California, Berkeley {mosharaf, matei, jtma, jordan,

More information

Load Balancing in Data Center Networks

Load Balancing in Data Center Networks Load Balancing in Data Center Networks Henry Xu Computer Science City University of Hong Kong HKUST, March 2, 2015 Background Aggregator Aggregator Aggregator Worker Worker Worker Worker Low latency for

More information

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model

More information

CSE-E5430 Scalable Cloud Computing Lecture 2

CSE-E5430 Scalable Cloud Computing Lecture 2 CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 14.9-2015 1/36 Google MapReduce A scalable batch processing

More information

Large-Scale Data Processing

Large-Scale Data Processing Large-Scale Data Processing Eiko Yoneki eiko.yoneki@cl.cam.ac.uk http://www.cl.cam.ac.uk/~ey204 Systems Research Group University of Cambridge Computer Laboratory 2010s: Big Data Why Big Data now? Increase

More information

Load Balancing in Data Center Networks

Load Balancing in Data Center Networks Load Balancing in Data Center Networks Henry Xu Department of Computer Science City University of Hong Kong ShanghaiTech Symposium on Data Scinece June 23, 2015 Today s plan Overview of research in DCN

More information

Phurti: Application and Network-Aware Flow Scheduling for Multi-Tenant MapReduce Clusters

Phurti: Application and Network-Aware Flow Scheduling for Multi-Tenant MapReduce Clusters Phurti: Application and Network-Aware Flow Scheduling for Multi-Tenant MapReduce Clusters Chris X. Cai, Shayan Saeed, Indranil Gupta, Roy H. Campbell, Franck Le Department of Computer Science University

More information

Journal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS)

Journal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS) Journal of science e ISSN 2277-3290 Print ISSN 2277-3282 Information Technology www.journalofscience.net STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS) S. Chandra

More information

Fault Tolerance in Hadoop for Work Migration

Fault Tolerance in Hadoop for Work Migration 1 Fault Tolerance in Hadoop for Work Migration Shivaraman Janakiraman Indiana University Bloomington ABSTRACT Hadoop is a framework that runs applications on large clusters which are built on numerous

More information

Cloud Computing Trends

Cloud Computing Trends UT DALLAS Erik Jonsson School of Engineering & Computer Science Cloud Computing Trends What is cloud computing? Cloud computing refers to the apps and services delivered over the internet. Software delivered

More information

Detection of Distributed Denial of Service Attack with Hadoop on Live Network

Detection of Distributed Denial of Service Attack with Hadoop on Live Network Detection of Distributed Denial of Service Attack with Hadoop on Live Network Suchita Korad 1, Shubhada Kadam 2, Prajakta Deore 3, Madhuri Jadhav 4, Prof.Rahul Patil 5 Students, Dept. of Computer, PCCOE,

More information

Chapter 7: Distributed Systems: Warehouse-Scale Computing. Fall 2011 Jussi Kangasharju

Chapter 7: Distributed Systems: Warehouse-Scale Computing. Fall 2011 Jussi Kangasharju Chapter 7: Distributed Systems: Warehouse-Scale Computing Fall 2011 Jussi Kangasharju Chapter Outline Warehouse-scale computing overview Workloads and software infrastructure Failures and repairs Note:

More information

Petabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, UC Berkeley, Nov 2012

Petabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, UC Berkeley, Nov 2012 Petabyte Scale Data at Facebook Dhruba Borthakur, Engineer at Facebook, UC Berkeley, Nov 2012 Agenda 1 Types of Data 2 Data Model and API for Facebook Graph Data 3 SLTP (Semi-OLTP) and Analytics data 4

More information

Residual Traffic Based Task Scheduling in Hadoop

Residual Traffic Based Task Scheduling in Hadoop Residual Traffic Based Task Scheduling in Hadoop Daichi Tanaka University of Tsukuba Graduate School of Library, Information and Media Studies Tsukuba, Japan e-mail: s1421593@u.tsukuba.ac.jp Masatoshi

More information

Data-Intensive Computing with Map-Reduce and Hadoop

Data-Intensive Computing with Map-Reduce and Hadoop Data-Intensive Computing with Map-Reduce and Hadoop Shamil Humbetov Department of Computer Engineering Qafqaz University Baku, Azerbaijan humbetov@gmail.com Abstract Every day, we create 2.5 quintillion

More information

Fair Scheduler. Table of contents

Fair Scheduler. Table of contents Table of contents 1 Purpose... 2 2 Introduction... 2 3 Installation... 3 4 Configuration...3 4.1 Scheduler Parameters in mapred-site.xml...4 4.2 Allocation File (fair-scheduler.xml)... 6 4.3 Access Control

More information

Apache Flink Next-gen data analysis. Kostas Tzoumas ktzoumas@apache.org @kostas_tzoumas

Apache Flink Next-gen data analysis. Kostas Tzoumas ktzoumas@apache.org @kostas_tzoumas Apache Flink Next-gen data analysis Kostas Tzoumas ktzoumas@apache.org @kostas_tzoumas What is Flink Project undergoing incubation in the Apache Software Foundation Originating from the Stratosphere research

More information

San Diego Supercomputer Center, UCSD. Institute for Digital Research and Education, UCLA

San Diego Supercomputer Center, UCSD. Institute for Digital Research and Education, UCLA Facilitate Parallel Computation Using Kepler Workflow System on Virtual Resources Jianwu Wang 1, Prakashan Korambath 2, Ilkay Altintas 1 1 San Diego Supercomputer Center, UCSD 2 Institute for Digital Research

More information

Optimization and analysis of large scale data sorting algorithm based on Hadoop

Optimization and analysis of large scale data sorting algorithm based on Hadoop Optimization and analysis of large scale sorting algorithm based on Hadoop Zhuo Wang, Longlong Tian, Dianjie Guo, Xiaoming Jiang Institute of Information Engineering, Chinese Academy of Sciences {wangzhuo,

More information

Unified Big Data Processing with Apache Spark. Matei Zaharia @matei_zaharia

Unified Big Data Processing with Apache Spark. Matei Zaharia @matei_zaharia Unified Big Data Processing with Apache Spark Matei Zaharia @matei_zaharia What is Apache Spark? Fast & general engine for big data processing Generalizes MapReduce model to support more types of processing

More information

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control EP/K006487/1 UK PI: Prof Gareth Taylor (BU) China PI: Prof Yong-Hua Song (THU) Consortium UK Members: Brunel University

More information

Ali Ghodsi Head of PM and Engineering Databricks

Ali Ghodsi Head of PM and Engineering Databricks Making Big Data Simple Ali Ghodsi Head of PM and Engineering Databricks Big Data is Hard: A Big Data Project Tasks Tasks Build a Hadoop cluster Challenges Clusters hard to setup and manage Build a data

More information

Adaptive Task Scheduling for MultiJob MapReduce Environments

Adaptive Task Scheduling for MultiJob MapReduce Environments Adaptive Task Scheduling for MultiJob MapReduce Environments Jordà Polo, David de Nadal, David Carrera, Yolanda Becerra, Vicenç Beltran, Jordi Torres and Eduard Ayguadé Barcelona Supercomputing Center

More information

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel Parallel Databases Increase performance by performing operations in parallel Parallel Architectures Shared memory Shared disk Shared nothing closely coupled loosely coupled Parallelism Terminology Speedup:

More information

Parallel Computing. Benson Muite. benson.muite@ut.ee http://math.ut.ee/ benson. https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage

Parallel Computing. Benson Muite. benson.muite@ut.ee http://math.ut.ee/ benson. https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage Parallel Computing Benson Muite benson.muite@ut.ee http://math.ut.ee/ benson https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage 3 November 2014 Hadoop, Review Hadoop Hadoop History Hadoop Framework

More information

FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data

FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data Miguel Liroz-Gistau, Reza Akbarinia, Patrick Valduriez To cite this version: Miguel Liroz-Gistau, Reza Akbarinia, Patrick Valduriez. FP-Hadoop:

More information

PIKACHU: How to Rebalance Load in Optimizing MapReduce On Heterogeneous Clusters

PIKACHU: How to Rebalance Load in Optimizing MapReduce On Heterogeneous Clusters PIKACHU: How to Rebalance Load in Optimizing MapReduce On Heterogeneous Clusters Rohan Gandhi, Di Xie, Y. Charlie Hu Purdue University Abstract For power, cost, and pricing reasons, datacenters are evolving

More information

Big Data Analytics - Accelerated. stream-horizon.com

Big Data Analytics - Accelerated. stream-horizon.com Big Data Analytics - Accelerated stream-horizon.com StreamHorizon & Big Data Integrates into your Data Processing Pipeline Seamlessly integrates at any point of your your data processing pipeline Implements

More information

RAMCloud and the Low- Latency Datacenter. John Ousterhout Stanford University

RAMCloud and the Low- Latency Datacenter. John Ousterhout Stanford University RAMCloud and the Low- Latency Datacenter John Ousterhout Stanford University Most important driver for innovation in computer systems: Rise of the datacenter Phase 1: large scale Phase 2: low latency Introduction

More information

Real Time Fraud Detection With Sequence Mining on Big Data Platform. Pranab Ghosh Big Data Consultant IEEE CNSV meeting, May 6 2014 Santa Clara, CA

Real Time Fraud Detection With Sequence Mining on Big Data Platform. Pranab Ghosh Big Data Consultant IEEE CNSV meeting, May 6 2014 Santa Clara, CA Real Time Fraud Detection With Sequence Mining on Big Data Platform Pranab Ghosh Big Data Consultant IEEE CNSV meeting, May 6 2014 Santa Clara, CA Open Source Big Data Eco System Query (NOSQL) : Cassandra,

More information

Architectures for Big Data Analytics A database perspective

Architectures for Big Data Analytics A database perspective Architectures for Big Data Analytics A database perspective Fernando Velez Director of Product Management Enterprise Information Management, SAP June 2013 Outline Big Data Analytics Requirements Spectrum

More information

Big Data With Hadoop

Big Data With Hadoop With Saurabh Singh singh.903@osu.edu The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials

More information

Energy Efficient MapReduce

Energy Efficient MapReduce Energy Efficient MapReduce Motivation: Energy consumption is an important aspect of datacenters efficiency, the total power consumption in the united states has doubled from 2000 to 2005, representing

More information

MASSIVE DATA PROCESSING (THE GOOGLE WAY ) 27/04/2015. Fundamentals of Distributed Systems. Inside Google circa 2015

MASSIVE DATA PROCESSING (THE GOOGLE WAY ) 27/04/2015. Fundamentals of Distributed Systems. Inside Google circa 2015 7/04/05 Fundamentals of Distributed Systems CC5- PROCESAMIENTO MASIVO DE DATOS OTOÑO 05 Lecture 4: DFS & MapReduce I Aidan Hogan aidhog@gmail.com Inside Google circa 997/98 MASSIVE DATA PROCESSING (THE

More information

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. www.spark- project.org. University of California, Berkeley UC BERKELEY

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. www.spark- project.org. University of California, Berkeley UC BERKELEY Spark in Action Fast Big Data Analytics using Scala Matei Zaharia University of California, Berkeley www.spark- project.org UC BERKELEY My Background Grad student in the AMP Lab at UC Berkeley» 50- person

More information

Towards Predictable Datacenter Networks

Towards Predictable Datacenter Networks Towards Predictable Datacenter Networks Hitesh Ballani, Paolo Costa, Thomas Karagiannis and Ant Rowstron Microsoft Research, Cambridge This talk is about Guaranteeing network performance for tenants in

More information

Hadoop Ecosystem B Y R A H I M A.

Hadoop Ecosystem B Y R A H I M A. Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open

More information

Big Data Challenges in Bioinformatics

Big Data Challenges in Bioinformatics Big Data Challenges in Bioinformatics BARCELONA SUPERCOMPUTING CENTER COMPUTER SCIENCE DEPARTMENT Autonomic Systems and ebusiness Pla?orms Jordi Torres Jordi.Torres@bsc.es Talk outline! We talk about Petabyte?

More information

OPERATING SYSTEMS SCHEDULING

OPERATING SYSTEMS SCHEDULING OPERATING SYSTEMS SCHEDULING Jerry Breecher 5: CPU- 1 CPU What Is In This Chapter? This chapter is about how to get a process attached to a processor. It centers around efficient algorithms that perform

More information

Hedera: Dynamic Flow Scheduling for Data Center Networks

Hedera: Dynamic Flow Scheduling for Data Center Networks Hedera: Dynamic Flow Scheduling for Data Center Networks Mohammad Al-Fares Sivasankar Radhakrishnan Barath Raghavan * Nelson Huang Amin Vahdat UC San Diego * Williams College - USENIX NSDI 2010 - Motivation!"#$%&'($)*

More information

Hadoop & Spark Using Amazon EMR

Hadoop & Spark Using Amazon EMR Hadoop & Spark Using Amazon EMR Michael Hanisch, AWS Solutions Architecture 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Agenda Why did we build Amazon EMR? What is Amazon EMR?

More information

Accelerating Hadoop MapReduce Using an In-Memory Data Grid

Accelerating Hadoop MapReduce Using an In-Memory Data Grid Accelerating Hadoop MapReduce Using an In-Memory Data Grid By David L. Brinker and William L. Bain, ScaleOut Software, Inc. 2013 ScaleOut Software, Inc. 12/27/2012 H adoop has been widely embraced for

More information

Network-Aware Scheduling of MapReduce Framework on Distributed Clusters over High Speed Networks

Network-Aware Scheduling of MapReduce Framework on Distributed Clusters over High Speed Networks Network-Aware Scheduling of MapReduce Framework on Distributed Clusters over High Speed Networks Praveenkumar Kondikoppa, Chui-Hui Chiu, Cheng Cui, Lin Xue and Seung-Jong Park Department of Computer Science,

More information

A Hybrid Scheduling Approach for Scalable Heterogeneous Hadoop Systems

A Hybrid Scheduling Approach for Scalable Heterogeneous Hadoop Systems A Hybrid Scheduling Approach for Scalable Heterogeneous Hadoop Systems Aysan Rasooli Department of Computing and Software McMaster University Hamilton, Canada Email: rasooa@mcmaster.ca Douglas G. Down

More information

Beyond the Stars: Revisiting Virtual Cluster Embeddings

Beyond the Stars: Revisiting Virtual Cluster Embeddings Beyond the Stars: Revisiting Virtual Cluster Embeddings Matthias Rost Technische Universität Berlin September 7th, 2015, Télécom-ParisTech Joint work with Carlo Fuerst, Stefan Schmid Published in ACM SIGCOMM

More information

Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics

Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics Juwei Shi, Yunjie Qiu, Umar Farooq Minhas, Limei Jiao, Chen Wang, Berthold Reinwald, and Fatma Özcan IBM Research China IBM Almaden

More information

Breaking the MapReduce Stage Barrier

Breaking the MapReduce Stage Barrier Cluster Computing manuscript No. (will be inserted by the editor) Breaking the MapReduce Stage Barrier Abhishek Verma Brian Cho Nicolas Zea Indranil Gupta Roy H. Campbell Received: date / Accepted: date

More information

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015 Hadoop MapReduce and Spark Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015 Outline Hadoop Hadoop Import data on Hadoop Spark Spark features Scala MLlib MLlib

More information

MapReduce, Hadoop and Amazon AWS

MapReduce, Hadoop and Amazon AWS MapReduce, Hadoop and Amazon AWS Yasser Ganjisaffar http://www.ics.uci.edu/~yganjisa February 2011 What is Hadoop? A software framework that supports data-intensive distributed applications. It enables

More information

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay Weekly Report Hadoop Introduction submitted By Anurag Sharma Department of Computer Science and Engineering Indian Institute of Technology Bombay Chapter 1 What is Hadoop? Apache Hadoop (High-availability

More information

Paolo Costa costa@imperial.ac.uk

Paolo Costa costa@imperial.ac.uk joint work with Ant Rowstron, Austin Donnelly, and Greg O Shea (MSR Cambridge) Hussam Abu-Libdeh, Simon Schubert (Interns) Paolo Costa costa@imperial.ac.uk Paolo Costa CamCube - Rethinking the Data Center

More information

Overlapping Data Transfer With Application Execution on Clusters

Overlapping Data Transfer With Application Execution on Clusters Overlapping Data Transfer With Application Execution on Clusters Karen L. Reid and Michael Stumm reid@cs.toronto.edu stumm@eecg.toronto.edu Department of Computer Science Department of Electrical and Computer

More information

Reliable and Locality-driven scheduling in Hadoop

Reliable and Locality-driven scheduling in Hadoop Reliable and Locality-driven scheduling in Hadoop Phuong Tran Anh Institute Superior Tecnico Abstract. The increasing use of computing resources in our daily lives leads to data being generated at an astonishing

More information

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets

More information

Importance of Data locality

Importance of Data locality Importance of Data Locality - Gerald Abstract Scheduling Policies Test Applications Evaluation metrics Tests in Hadoop Test environment Tests Observations Job run time vs. Mmax Job run time vs. number

More information

Big Data Processing. Patrick Wendell Databricks

Big Data Processing. Patrick Wendell Databricks Big Data Processing Patrick Wendell Databricks About me Committer and PMC member of Apache Spark Former PhD student at Berkeley Left Berkeley to help found Databricks Now managing open source work at Databricks

More information

Locality-Sensitive Operators for Parallel Main-Memory Database Clusters

Locality-Sensitive Operators for Parallel Main-Memory Database Clusters Locality-Sensitive Operators for Parallel Main-Memory Database Clusters Wolf Rödiger, Tobias Mühlbauer, Philipp Unterbrunner*, Angelika Reiser, Alfons Kemper, Thomas Neumann Technische Universität München,

More information

LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT

LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT Samira Daneshyar 1 and Majid Razmjoo 2 1,2 School of Computer Science, Centre of Software Technology and Management (SOFTEM),

More information

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 ISSN 2278-7763

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 ISSN 2278-7763 International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 A Discussion on Testing Hadoop Applications Sevuga Perumal Chidambaram ABSTRACT The purpose of analysing

More information