Architecting Distributed Databases for Failure A Case Study with Druid

Similar documents

Network Virtualization Platform (NVP) Incident Reports

Application Performance Management for Enterprise Applications

Object Storage: A Growing Opportunity for Service Providers. White Paper. Prepared for: 2012 Neovise, LLC. All Rights Reserved.

Accelerating Web-Based SQL Server Applications with SafePeak Plug and Play Dynamic Database Caching

Mark Bennett. Search and the Virtual Machine

MakeMyTrip CUSTOMER SUCCESS STORY

Top Ten Questions. to Ask Your Primary Storage Provider About Their Data Efficiency. May Copyright 2014 Permabit Technology Corporation

ONSITE TRAINING CATALOG

Multi-Datacenter Replication

Capacity planning with Microsoft System Center

The Cloud Hosting Revolution: Learn How to Cut Costs and Eliminate Downtime with GlowHost's Cloud Hosting Services

Table of Contents. Abstract. Cloud computing basics. The app economy. The API platform for the app economy

ORACLE COHERENCE 12CR2

SQL Server Whitepaper. Top 5 SQL Server. Cluster Setup Mistakes. By Kendra Little Brent Ozar Unlimited

Hadoop Cluster Applications

MANAGEMENT SUMMARY INTRODUCTION KEY MESSAGES. Written by: Michael Azoff. Published June 2015, Ovum

1. Make sure you are clear about the terms being used

A Ranger4 Guide to. Application Performance Management. Ranger

Big data management with IBM General Parallel File System

Practical Cassandra. Vitalii

Guideline for stresstest Page 1 of 6. Stress test

Scaling Microsoft SQL Server

Performance Prediction, Sizing and Capacity Planning for Distributed E-Commerce Applications

Distributed Storage Systems

SCALABILITY AND AVAILABILITY

A new Breed of Managed Hosting for the Cloud Computing Age. A Neovise Vendor White Paper, Prepared for SoftLayer

Tips and Tricks for Using Oracle TimesTen In-Memory Database in the Application Tier

High Availability with Postgres Plus Advanced Server. An EnterpriseDB White Paper

Network Management and Monitoring Software

BI4.x Architecture SAP CEG & GTM BI

Monitoring Best Practices for COMMERCE

VMware Virtualization and Cloud Management Overview VMware Inc. All rights reserved

Online Firm Improves Performance, Customer Service with Mission-Critical Storage Solution

Deploying Global Clusters for Site Disaster Recovery via Symantec Storage Foundation on Infortrend Systems

Oracle Database - Engineered for Innovation. Sedat Zencirci Teknoloji Satış Danışmanlığı Direktörü Türkiye ve Orta Asya

Affordable, Scalable, Reliable OLTP in a Cloud and Big Data World: IBM DB2 purescale

Powering the Next Generation Cloud with Azure Stack, Nano Server & Windows Server 2016! Jeff Woolsey Principal Program Manager Cloud & Enterprise

Exchange 2010 ITPro. Milan Marenčík Microsoft Services

Architecting For Failure Why Cloud Architecture is Different! Michael Stiefel

Network Infrastructure Services CS848 Project

Business Case for Data Center Network Consolidation

Comprehending the Tradeoffs between Deploying Oracle Database on RAID 5 and RAID 10 Storage Configurations. Database Solutions Engineering

are you helping your customers achieve their expectations for IT based service quality and availability?

CASE STUDY: Oracle TimesTen In-Memory Database and Shared Disk HA Implementation at Instance level. -ORACLE TIMESTEN 11gR1

Achieving Zero Downtime for Apps in SQL Environments

From the Monolith to Microservices: Evolving Your Architecture to Scale. Randy linkedin.com/in/randyshoup

Virtual Desktop Infrastructure Optimization with SysTrack Monitoring Tools and Login VSI Testing Tools

Reducing the Cost and Complexity of Business Continuity and Disaster Recovery for

Bigdata High Availability (HA) Architecture

bigdata Managing Scale in Ontological Systems

February Considerations When Choosing a Secure Web Gateway

Microsoft Private Cloud

Proactive Performance Management for Enterprise Databases

BIG DATA THE NEW OPPORTUNITY

Scaling Objectivity Database Performance with Panasas Scale-Out NAS Storage

Tableau Server 7.0 scalability

NetflixOSS A Cloud Native Architecture

Cloud Based Distributed Databases: The Future Ahead

Enabling Database-as-a-Service (DBaaS) within Enterprises or Cloud Offerings

HRG Assessment: Stratus everrun Enterprise

ORACLE DATABASE 10G ENTERPRISE EDITION

Transactions and ACID in MongoDB

Windows Server 2012 Hyper-V Installation and Configuration Guide

Microsoft Private Cloud Fast Track

Cloud Based Application Architectures using Smart Computing

Agenda. Enterprise Application Performance Factors. Current form of Enterprise Applications. Factors to Application Performance.

Reference Model for Cloud Applications CONSIDERATIONS FOR SW VENDORS BUILDING A SAAS SOLUTION

High-Availability in the Cloud Architectural Best Practices

PROPOSAL To Develop an Enterprise Scale Disease Modeling Web Portal For Ascel Bio Updated March 2015

Building Success on Acquia Cloud:

Oracle Database 11g: RAC Administration Release 2

DELL s Oracle Database Advisor

POWER ALL GLOBAL FILE SYSTEM (PGFS)

Quantcast Petabyte Storage at Half Price with QFS!

Techniques for implementing & running robust and reliable DB-centric Grid Applications

Monitoring Cloud Applications. Amit Pathak

Case Study: Real-time Analytics With Druid. Salil Kalia, Tech Lead, TO THE NEW Digital

Design and Evolution of the Apache Hadoop File System(HDFS)

Administering a Microsoft SQL Server 2000 Database

AMAZON S3: ARCHITECTING FOR RESILIENCY IN THE FACE OF FAILURES Jason McHugh

Elastic Application Platform for Market Data Real-Time Analytics. for E-Commerce

Planning the Migration of Enterprise Applications to the Cloud

Building Private Cloud Architectures

Deploy App Orchestration 2.6 for High Availability and Disaster Recovery

WHITE PAPER. Data Center Fabrics. Why the Right Choice is so Important to Your Business

Testing Network Virtualization For Data Center and Cloud VERYX TECHNOLOGIES

Cloud Backup and Recovery

Big Data Analytics - Accelerated. stream-horizon.com

Ingres Replicated High Availability Cluster

Delivering Quality in Software Performance and Scalability Testing

Transcription:

Architecting Distributed Databases for Failure A Case Study with Druid Fangjin Yang Cofounder @ Imply

The Bad The Really Bad Overview The Catastrophic Best Practices: Operations

Everything is going to fail!

Requirements Scalable - Tens of thousands of nodes - Petabytes of raw data Available - 24 x 7 x 365 uptime Performant - Run as smoothly as possible when things go wrong

Druid Open source distributed data store Column oriented storage of event data Low latency OLAP queries & low latency data ingestion Initially designed to power a SaaS for online advertising (in AWS) Our real-world example case study

The Bad

Single Server Failures Common Occurs for every imaginable and unimaginable reason - Hardware malfunction, kernel panic, network outage, etc. - Minimal impact Standard solution: replication

Druid Segments Timestamp Dimensions Measures Timestamp Dimensions Measures 2015-01-01T00 2015-01-01T01 2015-01-02T05 2015-01-02T07 2015-01-03T05 2015-01-03T07 2015-01-01T00 2015-01-01T01 Timestamp Dimensions Measures 2015-01-02T05 2015-01-02T07 Timestamp Dimensions Measures 2015-01-03T05 2015-01-03T07 Segment_2015-01-01/2014-01-02 Segment_2015-01-02/2014-01-03 Segment_2015-01-03/2014-01-04 Partition by time

Replication Example Druid Historicals Load Segment_1 Segment_2 Queries Druid Brokers Segment_2015-01-01/2015-01-02 (Segment_1) Segment_2015-01-01/2015-01-02 (Segment_2) Segment_1 Client Segment_2015-01-01/2015-01-02 () Segment_2

Query Segment_1 Druid Historicals Load Segment_1 Segment_2 Queries Druid Brokers Segment_2015-01-01/2015-01-02 (Segment_1) Segment_2015-01-01/2015-01-02 (Segment_2) Segment_1 Client Segment_2015-01-01/2015-01-02 () Segment_2

Query Segment_1 Druid Historicals Load Segment_1 Segment_2 Queries Druid Brokers Segment_2015-01-01/2015-01-02 (Segment_1) Segment_2015-01-01/2015-01-02 (Segment_2) Segment_1 Client Segment_2015-01-01/2015-01-02 () Segment_2

Multi-Server Failures Common: 1 server fails Less common: >1 server fails Data center issues (rack failure) Two strategies: - fast recovery - multi-datacenter replication

Fast Recovery Complete data availability in the face of multi-server failures is hard! Focus on fast recovery instead Be careful of the pitfalls of fast recovery More viable in the cloud

Fast Recovery Example Druid Historicals Deep Storage (S3/HDFS) Segment_1 Segment_2 Queries Druid Brokers Segment_2015-01-01/2015-01-02 (Segment_1) Load Segment_2015-01-01/2015-01-02 (Segment_2) Load Segment_1 Client Segment_2015-01-01/2015-01-02 () Segment_2

Fast Recovery Example Druid Historicals Deep Storage (S3/HDFS) Segment_1 Segment_2 Load Segment_1 Segment_2

Fast Recovery Example Druid Historicals Deep Storage (S3/HDFS) Segment_1 Segment_2 Load

Fast Recovery Example Druid Historicals Deep Storage (S3/HDFS) Segment_1 Segment_2 Druid Coordinator Load Load Segment_1, Load Segment_2,

Fast Recovery Example Druid Historicals Deep Storage (S3/HDFS) Segment_1 Segment_2 Druid Coordinator Load Segment_1 Segment_2

Dangers of Fast Recovery Easy to create bottlenecks - Prioritize how resources are spent during recovery - Druid prioritizes data availability and throttles replication Beware query hotspots - Intelligent load balancing during recovery is important

Fast Recovery Example Druid Historicals Deep Storage (S3/HDFS) Segment_1 Segment_2 Load Segment_1 Segment_2

Fast Recovery Example Deep Storage (S3/HDFS) Druid Historicals Load Segment_1 Segment_2 Overloaded!

The Really Bad

Data Center Outage Very uncommon Power loss Can be extremely disruptive without proper planning Solution: Multi-datacenter replication Beware pitfalls of multi-datacenter replication

Multi-Datacenter Replication Druid Historicals Queries Druid Brokers Data Center 1 Druid Coordinator Segment_1 Segment_2 Data Center 2 Segment_1 Client Segment_2

Multi-Datacenter Pitfalls Coordination + leader election can be tricky Communication can require non-trivial network time Coordination usually done with heartbeats and quorum decisions Writes, failovers, & consistent reads require round trips

Multi-Datacenter Replication Data Center 1 Client Data Center 2

The Catastrophic

Why are things slow today? Poor performance is much worse than things completely failing Causes: - Heavy concurrent usage (multi-tenancy) - Hotspots & variability - Bad software update

Architecting for Multi-tenancy Small units of computation - No single query should starve out a cluster

Druid Multi-tenancy Druid Historical Segment_query_1 Queries Segment_query_2 Segment_query_1 Segment_query_3 Processing Order Segment_query_2 Segment_query_1 Segment_query_4

Architecting for Multi-tenancy Resource prioritization and isolation - Not all queries are equal - Not all users are equal

Druid Multi-tenancy Druid Historicals Tier 1: Older Data Queries Druid Brokers Dedicated for Older data Tier 2: Newer Data Client Dedicated for Newer Dat Tier 2: Newer Data

Hotspots Incredible variability in query performance among nodes Nodes may become slow but not fail Difficult to detect as there is nothing obviously wrong Solutions: - Hedged requests - Selective Replication - Latency Induced Probation

Hedged Requests Druid Historicals Segment_1 Segment_2 Druid Brokers Segment_1 Client Segment_2

Hedged Requests Druid Historicals Segment_1 Segment_2 Druid Brokers Segment_1 Client Segment_2

Minimizing Variability Selective Replication Latency-induced probation Great paper: https://web.stanford.edu/class/cs240/readings/tail-at-scale.pdf

Bad Software Updates It is very difficult to simulate production traffic - Testing/staging clusters mostly verify correctness No noticeable failures for a long time Common cause of cascading failures

Rolling Upgrades Be able to update different components with no down time Backwards compatibility is extremely important Roll back if things are bad

Rolling Upgrades Druid Historicals V2 Queries Druid Brokers V1 V1 Client V1 V1

Rolling Upgrades Druid Historicals V2 Queries Druid Brokers V1 V2 Client V1 V2

Rolling Upgrades Druid Historicals V2 Queries Druid Brokers V2 V2 Client V1 V2

Best Practices: Operations

Monitoring Detection of when things go badly Define your critical metrics and acceptable values

Alerts Alert on critical errors - Out of disk space, out of cluster capacity, etc. Design alerts to reduce noise - Distinguish warnings and alerts

Exploratory Analytics Extremely critical to diagnosing root causes quickly Not many organizations do this

Takeaways Everything is going to fail! - Use replication for single server failures - Use fast recovery for multi-server failures (when you don t want to set up another data center) - Use multi-datacenter replication when availability really matters - Alerting, monitoring, and exploratory analysis are critical

Thanks! @implydata @druidio @fangjin imply.io druid.io