Big Data Technology CS 236620, Technion, Spring 2014 System Design Principles Edward Bortnikov & Ronny Lempel Yahoo Labs, Haifa
Data = Systems We need to Move, Store and Process data
Big Data = Big Systems
How to Get the Big Systems Right? A multidisciplinary science on its own right Distributed Computing, Networking Hardware and Software Architecture Operations Research, Measurement, Performance Evaluation Power Management and even Civil Engineering In this course - aspects related to Computer Science We ll start with some principles And see how they manifest in real systems
An Ideal System Should 1. Scale
Keeping up With the Growth
Partitioning = Parallelism = Scalability
Architect s Dream - Throughput How many requests can be served in a unit of time?
Architect s Dream - Latency How long does a single request take?
Scaling Up? Scaling Out? Scale up Scale out
Example: Network Filesystems Monolithic (e.g., historical NFS) NFS server server:/a/b/z.txt Distributed (e.g., Hadoop FS) Data service (datanode) /users/bob/courses/cs101.txt <server_123, block 20> R/W request Metadata service (namenode)
Scale- Out Philosophy Scalability through Decoupling Whatever is split can be scaled independently HDFS: Metadata and Data accesses decoupled Minimize centralized processing Metadata accesses coordinated but lean Maximize I/O parallelism Clients access the data nodes concurrently
The Peer- to- Peer Approach Completely server-less All nodes and functions are fully symmetric E.g., in a distributed data store every node has a serving function and a management function Less favored in managed DC environments Very hard to maintain consistency guarantees Very hard to optimize globally Lightweight centralized critical services prevail
An Ideal System Should 2. Be Resilient
Protecting the Critical Services
Resilience = Redundancy
The Tail at Scale Problems are aggravated in large systems Component-level variability amplified by scale Failures and slow components are part of normal life, not an exception Two ways of addressing service variability Prevent bad things from happening by detecting and isolating the slow/flawed components Contain bad things through redundancy Hedged/tied requests, speculative task execution
Redundancy Means Synchronization
An Ideal System Should 3. Be designed for the right goal
Expected Workload Matters Latency-oriented Interactive, user-facing systems Example: Web search serving Throughput-oriented Back-end heavyweights Example: Web search indexing
Data Accessibility Matters vs Stream Warehouse
Access Patterns Matter Data Analytics Throughput-oriented applications Write-once (typically, append) Read-many (typically, large sequential reads) Online Transaction Processing (OLTP) Latency-oriented applications Write-intensive Typically, many small direct accesses Huge gray area in between
Hardware Constraints Matter http://www.ospmag.com/issue/article/time-is-not-always-on-our-side
Compute- or Data-Intensive? Compute Storage
Locality Matters Can computation and storage be aligned? Optimization? How repetitive is the workload? Optimization? Power-law distribution Dominant Items Pr( x > X ) ~ X α Long tail
Consistency MaZers Stricter properties = stronger consistency Are you prepared to handle weird stuff? Fancy stock alerts Is it okay to lose an event once in a while? Fancy a social network Bob deletes photos with his ex-date Alice Bob befriends Carol Can Carol observe these events in reverse order?
A Dialogue in the Wild Engineer: we afraid of any kind of synchronization Scientist: what kind of guarantee do you want to get? Engineer: let s build something simple Relax your consistency models We want the systems to be eventually consistent Scientist: this is an interesting problem Are you really sure this is what you want to get?
Example: Amazon s Outage Weak consistency models can lead to data loss
Services Over the Network
Elasticity Matters Resource demands often unknown in advance Driven by application popularity Goal: enablement of organic growth Add- (and pay-) as-you-grow Economies of scale Pool multiple datasets and services in huge DC s Better use of shared resources (personnel, real estate, electricity, network, compute and storage)
Cloud Computing Computing resources delivered over a network Infrastructure issues abstracted away ***-as-a-service SaaS, PaaS, IaaS,
A Word on Data Center Management
Designing the Air Flows Source: 42u Consulting
Power Efficiency - Surprising Facts At Facebook's Prineville, OR, facility, ambient air flows into the building, passing first through a series of filters to remove bugs, dust, and other contaminants. Previous estimates suggested that electricity consumption in massive server farms would double between 2005 and 2010. Instead, the number rose by 56% worldwide, and merely 36% in the US. The most efficient data centers now hover at temperatures closer to 80 degrees Fahrenheit, and instead of sweaters, the technicians walk around in shorts.
Summary Design for scale Design for fault-tolerance Know what you design for Be aware of the environment
Further Reading Lessons of Scale at Facebook Redesigning the Data Center (CACM)