Big Data Technology CS 236620, Technion, Spring 2014

Similar documents
Big Data Technology Core Hadoop: HDFS-YARN Internals

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Chapter 7. Using Hadoop Cluster and MapReduce

Apache Hadoop. Alexandru Costan

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Deploying Flash- Accelerated Hadoop with InfiniFlash from SanDisk

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

Hadoop Architecture. Part 1

BookKeeper. Flavio Junqueira Yahoo! Research, Barcelona. Hadoop in China 2011

Design and Evolution of the Apache Hadoop File System(HDFS)

Open source Google-style large scale data analysis with Hadoop

Big Data With Hadoop

Hadoop Distributed File System. T Seminar On Multimedia Eero Kurkela

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

CDH AND BUSINESS CONTINUITY:

Introduction to Cloud : Cloud and Cloud Storage. Lecture 2. Dr. Dalit Naor IBM Haifa Research Storage Systems. Dalit Naor, IBM Haifa Research

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Big Data - Infrastructure Considerations

Introduction to Hadoop

Big Data Analytics - Accelerated. stream-horizon.com

Reference Architecture and Best Practices for Virtualizing Hadoop Workloads Justin Murray VMware

Introduction to Hadoop

THE HADOOP DISTRIBUTED FILE SYSTEM

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems

Journal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS)

Snapshots in Hadoop Distributed File System

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS

Enabling High performance Big Data platform with RDMA

MapReduce and Hadoop Distributed File System

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

CSE-E5430 Scalable Cloud Computing Lecture 2

Massive Data Storage

This paper defines as "Classical"

How To Handle Big Data With A Data Scientist

CLOUD COMPUTING USING HADOOP TECHNOLOGY

Apache HBase. Crazy dances on the elephant back

Datenverwaltung im Wandel - Building an Enterprise Data Hub with

EMC IRODS RESOURCE DRIVERS

Processing of Hadoop using Highly Available NameNode

Performance Evaluation for BlobSeer and Hadoop using Machine Learning Algorithms

Storage Architectures for Big Data in the Cloud

Apache Hadoop: Past, Present, and Future

NoSQL and Hadoop Technologies On Oracle Cloud

Introduction to Cloud Computing

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Building your Big Data Architecture on Amazon Web Services

WHITEPAPER. A Technical Perspective on the Talena Data Availability Management Solution

Realtime Apache Hadoop at Facebook. Jonathan Gray & Dhruba Borthakur June 14, 2011 at SIGMOD, Athens

Cloud Service Model. Selecting a cloud service model. Different cloud service models within the enterprise

Cloud-Based dwaf A Real World Deployment Case Study. OWASP 5. April The OWASP Foundation

Hadoop & its Usage at Facebook

A Novel Cloud Based Elastic Framework for Big Data Preprocessing

Trends and Research Opportunities in Spatial Big Data Analytics and Cloud Computing NCSU GeoSpatial Forum

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Accelerating Hadoop MapReduce Using an In-Memory Data Grid

Data-Intensive Computing with Map-Reduce and Hadoop

MapReduce and Hadoop Distributed File System V I J A Y R A O

Elevate your analytics with SAS in the cloud

Cloud Computing Paradigm

Hadoop & its Usage at Facebook

HDFS Under the Hood. Sanjay Radia. Grid Computing, Hadoop Yahoo Inc.

Mr. Apichon Witayangkurn Department of Civil Engineering The University of Tokyo

BIG DATA-AS-A-SERVICE

Open Cloud System. (Integration of Eucalyptus, Hadoop and AppScale into deployment of University Private Cloud)

Hadoop Submitted in partial fulfillment of the requirement for the award of degree of Bachelor of Technology in Computer Science

Hadoop & Spark Using Amazon EMR

Proact whitepaper on Big Data

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

Hadoop. Apache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware.

Big Data Technology CS , Technion, Spring 2013

Big Data Primer. 1 Why Big Data? Alex Sverdlov alex@theparticle.com

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Networking in the Hadoop Cluster

Contents. 1. Introduction

SCALABILITY IN THE CLOUD

Amazon EC2 Product Details Page 1 of 5

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Cloud Computing. Chapter 1 Introducing Cloud Computing

The Hadoop Framework

RED HAT ENTERPRISE LINUX 7

Intro to Map/Reduce a.k.a. Hadoop

BlobSeer: Towards efficient data storage management on large-scale, distributed systems

Challenges for cloud software engineering

Planning the Migration of Enterprise Applications to the Cloud

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

Sriram Krishnan, Ph.D.

Hadoop Distributed File System. Jordan Prosch, Matt Kipps

Best Practices for Deploying Citrix XenDesktop on NexentaStor Open Storage

Multi-Datacenter Replication

Berkeley Ninja Architecture

HADOOP SOLUTION USING EMC ISILON AND CLOUDERA ENTERPRISE Efficient, Flexible In-Place Hadoop Analytics

Table of Contents. Abstract. Cloud computing basics. The app economy. The API platform for the app economy

Transcription:

Big Data Technology CS 236620, Technion, Spring 2014 System Design Principles Edward Bortnikov & Ronny Lempel Yahoo Labs, Haifa

Data = Systems We need to Move, Store and Process data

Big Data = Big Systems

How to Get the Big Systems Right? A multidisciplinary science on its own right Distributed Computing, Networking Hardware and Software Architecture Operations Research, Measurement, Performance Evaluation Power Management and even Civil Engineering In this course - aspects related to Computer Science We ll start with some principles And see how they manifest in real systems

An Ideal System Should 1. Scale

Keeping up With the Growth

Partitioning = Parallelism = Scalability

Architect s Dream - Throughput How many requests can be served in a unit of time?

Architect s Dream - Latency How long does a single request take?

Scaling Up? Scaling Out? Scale up Scale out

Example: Network Filesystems Monolithic (e.g., historical NFS) NFS server server:/a/b/z.txt Distributed (e.g., Hadoop FS) Data service (datanode) /users/bob/courses/cs101.txt <server_123, block 20> R/W request Metadata service (namenode)

Scale- Out Philosophy Scalability through Decoupling Whatever is split can be scaled independently HDFS: Metadata and Data accesses decoupled Minimize centralized processing Metadata accesses coordinated but lean Maximize I/O parallelism Clients access the data nodes concurrently

The Peer- to- Peer Approach Completely server-less All nodes and functions are fully symmetric E.g., in a distributed data store every node has a serving function and a management function Less favored in managed DC environments Very hard to maintain consistency guarantees Very hard to optimize globally Lightweight centralized critical services prevail

An Ideal System Should 2. Be Resilient

Protecting the Critical Services

Resilience = Redundancy

The Tail at Scale Problems are aggravated in large systems Component-level variability amplified by scale Failures and slow components are part of normal life, not an exception Two ways of addressing service variability Prevent bad things from happening by detecting and isolating the slow/flawed components Contain bad things through redundancy Hedged/tied requests, speculative task execution

Redundancy Means Synchronization

An Ideal System Should 3. Be designed for the right goal

Expected Workload Matters Latency-oriented Interactive, user-facing systems Example: Web search serving Throughput-oriented Back-end heavyweights Example: Web search indexing

Data Accessibility Matters vs Stream Warehouse

Access Patterns Matter Data Analytics Throughput-oriented applications Write-once (typically, append) Read-many (typically, large sequential reads) Online Transaction Processing (OLTP) Latency-oriented applications Write-intensive Typically, many small direct accesses Huge gray area in between

Hardware Constraints Matter http://www.ospmag.com/issue/article/time-is-not-always-on-our-side

Compute- or Data-Intensive? Compute Storage

Locality Matters Can computation and storage be aligned? Optimization? How repetitive is the workload? Optimization? Power-law distribution Dominant Items Pr( x > X ) ~ X α Long tail

Consistency MaZers Stricter properties = stronger consistency Are you prepared to handle weird stuff? Fancy stock alerts Is it okay to lose an event once in a while? Fancy a social network Bob deletes photos with his ex-date Alice Bob befriends Carol Can Carol observe these events in reverse order?

A Dialogue in the Wild Engineer: we afraid of any kind of synchronization Scientist: what kind of guarantee do you want to get? Engineer: let s build something simple Relax your consistency models We want the systems to be eventually consistent Scientist: this is an interesting problem Are you really sure this is what you want to get?

Example: Amazon s Outage Weak consistency models can lead to data loss

Services Over the Network

Elasticity Matters Resource demands often unknown in advance Driven by application popularity Goal: enablement of organic growth Add- (and pay-) as-you-grow Economies of scale Pool multiple datasets and services in huge DC s Better use of shared resources (personnel, real estate, electricity, network, compute and storage)

Cloud Computing Computing resources delivered over a network Infrastructure issues abstracted away ***-as-a-service SaaS, PaaS, IaaS,

A Word on Data Center Management

Designing the Air Flows Source: 42u Consulting

Power Efficiency - Surprising Facts At Facebook's Prineville, OR, facility, ambient air flows into the building, passing first through a series of filters to remove bugs, dust, and other contaminants. Previous estimates suggested that electricity consumption in massive server farms would double between 2005 and 2010. Instead, the number rose by 56% worldwide, and merely 36% in the US. The most efficient data centers now hover at temperatures closer to 80 degrees Fahrenheit, and instead of sweaters, the technicians walk around in shorts.

Summary Design for scale Design for fault-tolerance Know what you design for Be aware of the environment

Further Reading Lessons of Scale at Facebook Redesigning the Data Center (CACM)