Big Data Technology CS 236620, Technion, Spring 2014

Big Data Technology CS 236620, Technion, Spring 2014 System Design Principles Edward Bortnikov & Ronny Lempel Yahoo Labs, Haifa

Data = Systems We need to Move, Store and Process data

Big Data = Big Systems

How to Get the Big Systems Right? A multidisciplinary science on its own right Distributed Computing, Networking Hardware and Software Architecture Operations Research, Measurement, Performance Evaluation Power Management and even Civil Engineering In this course - aspects related to Computer Science We ll start with some principles And see how they manifest in real systems

An Ideal System Should 1. Scale

Keeping up With the Growth

Partitioning = Parallelism = Scalability

Architect s Dream - Throughput How many requests can be served in a unit of time?

Architect s Dream - Latency How long does a single request take?

Scaling Up? Scaling Out? Scale up Scale out

Example: Network Filesystems Monolithic (e.g., historical NFS) NFS server server:/a/b/z.txt Distributed (e.g., Hadoop FS) Data service (datanode) /users/bob/courses/cs101.txt <server_123, block 20> R/W request Metadata service (namenode)

Scale- Out Philosophy Scalability through Decoupling Whatever is split can be scaled independently HDFS: Metadata and Data accesses decoupled Minimize centralized processing Metadata accesses coordinated but lean Maximize I/O parallelism Clients access the data nodes concurrently

The Peer- to- Peer Approach Completely server-less All nodes and functions are fully symmetric E.g., in a distributed data store every node has a serving function and a management function Less favored in managed DC environments Very hard to maintain consistency guarantees Very hard to optimize globally Lightweight centralized critical services prevail

An Ideal System Should 2. Be Resilient

Protecting the Critical Services

Resilience = Redundancy

The Tail at Scale Problems are aggravated in large systems Component-level variability amplified by scale Failures and slow components are part of normal life, not an exception Two ways of addressing service variability Prevent bad things from happening by detecting and isolating the slow/flawed components Contain bad things through redundancy Hedged/tied requests, speculative task execution

Redundancy Means Synchronization

An Ideal System Should 3. Be designed for the right goal

Expected Workload Matters Latency-oriented Interactive, user-facing systems Example: Web search serving Throughput-oriented Back-end heavyweights Example: Web search indexing

Data Accessibility Matters vs Stream Warehouse

Access Patterns Matter Data Analytics Throughput-oriented applications Write-once (typically, append) Read-many (typically, large sequential reads) Online Transaction Processing (OLTP) Latency-oriented applications Write-intensive Typically, many small direct accesses Huge gray area in between

Hardware Constraints Matter http://www.ospmag.com/issue/article/time-is-not-always-on-our-side

Compute- or Data-Intensive? Compute Storage

Locality Matters Can computation and storage be aligned? Optimization? How repetitive is the workload? Optimization? Power-law distribution Dominant Items Pr( x > X ) ~ X α Long tail

Consistency MaZers Stricter properties = stronger consistency Are you prepared to handle weird stuff? Fancy stock alerts Is it okay to lose an event once in a while? Fancy a social network Bob deletes photos with his ex-date Alice Bob befriends Carol Can Carol observe these events in reverse order?

A Dialogue in the Wild Engineer: we afraid of any kind of synchronization Scientist: what kind of guarantee do you want to get? Engineer: let s build something simple Relax your consistency models We want the systems to be eventually consistent Scientist: this is an interesting problem Are you really sure this is what you want to get?

Example: Amazon s Outage Weak consistency models can lead to data loss

Services Over the Network

Elasticity Matters Resource demands often unknown in advance Driven by application popularity Goal: enablement of organic growth Add- (and pay-) as-you-grow Economies of scale Pool multiple datasets and services in huge DC s Better use of shared resources (personnel, real estate, electricity, network, compute and storage)

Cloud Computing Computing resources delivered over a network Infrastructure issues abstracted away ***-as-a-service SaaS, PaaS, IaaS,

A Word on Data Center Management

Designing the Air Flows Source: 42u Consulting

Power Efficiency - Surprising Facts At Facebook's Prineville, OR, facility, ambient air flows into the building, passing first through a series of filters to remove bugs, dust, and other contaminants. Previous estimates suggested that electricity consumption in massive server farms would double between 2005 and 2010. Instead, the number rose by 56% worldwide, and merely 36% in the US. The most efficient data centers now hover at temperatures closer to 80 degrees Fahrenheit, and instead of sweaters, the technicians walk around in shorts.

Summary Design for scale Design for fault-tolerance Know what you design for Be aware of the environment

Further Reading Lessons of Scale at Facebook Redesigning the Data Center (CACM)