Redis Cluster. a pragmatic approach to distribution



Similar documents
Lecture 3: Scaling by Load Balancing 1. Comments on reviews i. 2. Topic 1: Scalability a. QUESTION: What are problems? i. These papers look at

Apache HBase. Crazy dances on the elephant back

High Availability Solutions for the MariaDB and MySQL Database

Cisco TelePresence Management Suite Extension for Microsoft Exchange Version 4.0.3

Load-Balancing Introduction (with examples...)

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Distributed Data Stores

1. Comments on reviews a. Need to avoid just summarizing web page asks you for:

Network Attached Storage. Jinfeng Yang Oct/19/2015

In Memory Accelerator for MongoDB

Cisco TelePresence Management Suite Extension for Microsoft Exchange Version 4.0

F1: A Distributed SQL Database That Scales. Presentation by: Alex Degtiar (adegtiar@cmu.edu) /21/2013

Load Balancing and Sessions. C. Kopparapu, Load Balancing Servers, Firewalls and Caches. Wiley, 2002.

YouTube Vitess. Cloud-Native MySQL. Oracle OpenWorld Conference October 26, Anthony Yeh, Software Engineer, YouTube.

Cloud Computing at Google. Architecture

Facebook: Cassandra. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation

The Value of Content Distribution Networks Mike Axelrod, Google Google Public

Cluster Computing. ! Fault tolerance. ! Stateless. ! Throughput. ! Stateful. ! Response time. Architectures. Stateless vs. Stateful.

Practical Cassandra. Vitalii

Distributed Data Management

ICP. Cache Hierarchies. Squid. Squid Cache ICP Use. Squid. Squid

PARALLELS CLOUD STORAGE

Journal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS)

Dynamic Load Balancing and Node Migration in a Continuous Media Network

Distributed Systems (CS236351) Exercise 3

Scalability of web applications. CSCI 470: Web Science Keith Vertanen

Scaling Database Performance in Azure

Dave Stokes MySQL Community Manager

Big Data With Hadoop

Decentralized Peer-to-Peer Network Architecture: Gnutella and Freenet

CS 188/219. Scalable Internet Services Andrew Mutz October 8, 2015

Configuring WCCP v2 with Websense Content Gateway the Web proxy for Web Security Gateway

CONSUL AS A MONITORING SERVICE

Cisco TelePresence Management Suite Extension for Microsoft Exchange Version 4.0.1

Napster and Gnutella: a Comparison of two Popular Peer-to-Peer Protocols. Anthony J. Howe Supervisor: Dr. Mantis Cheng University of Victoria

MySQL Fabric: High Availability Solution for Connector/Python

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

FAQ: BroadLink Multi-homing Load Balancers

Stateful Firewalls. Hank and Foo

ICS Technology. PADS Viewer Manual. ICS Technology Inc PO Box 4063 Middletown, NJ

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

11.1. Performance Monitoring

How To Handle A Power Outage On A Server Farm (For A Small Business)

Advantech WebAccess Device Driver Guide. BwSNMP Advantech WebAccess to SNMP Agent (Simple Network Management Protocol) Device Driver Guide

Oracle Endeca Server. Cluster Guide. Version May 2013

Deploying Load balancing for Novell Border Manager Proxy using Session Failover feature of NBM and L4 Switch

Large-Scale Web Applications

Configuring IPS High Bandwidth Using EtherChannel Load Balancing

The Hadoop Distributed File System

Frequently Asked Questions: Cisco Jabber 9.x for Android

Basic Exchange Setup Guide

Sharding with postgres_fdw

Firewall Load Balancing

NoSQL in der Cloud Why? Andreas Hartmann

Sharding and MongoDB. Release MongoDB, Inc.

H3C Firewall and UTM Devices DNS and NAT Configuration Examples (Comware V5)

Sharding and MongoDB. Release MongoDB, Inc.

Apache Traffic Server Extensible Host Resolution

DISTRIBUTED AND PARALLELL DATABASE

FortiOS Handbook - Load Balancing VERSION 5.2.2

Highly Available Mobile Services Infrastructure Using Oracle Berkeley DB

Veeam ONE What s New in v9?

extensible record stores document stores key-value stores Rick Cattel s clustering from Scalable SQL and NoSQL Data Stores SIGMOD Record, 2010

NoSQL Databases. Nikos Parlavantzas

Massive Data Storage

IM and Presence Service Network Setup

HP LeftHand SAN Solutions

Load Balancing and High availability using CTDB + DNS round robin

A Survey of Distributed Database Management Systems

Direct NFS - Design considerations for next-gen NAS appliances optimized for database workloads Akshay Shah Gurmeet Goindi Oracle

HDFS Users Guide. Table of contents

This section describes how to set up, find and delete community strings.

2 Downloading Access Manager 3.1 SP4 IR1

TCP and Wireless Networks Classical Approaches Optimizations TCP for 2.5G/3G Systems. Lehrstuhl für Informatik 4 Kommunikation und verteilte Systeme

Basic Exchange Setup Guide

Load Balancing Microsoft Sharepoint 2010 Load Balancing Microsoft Sharepoint Deployment Guide

Netfilter Failover. Connection Tracking State Replication. Krisztián Kovács

COMMANDS 1 Overview... 1 Default Commands... 2 Creating a Script from a Command Document Revision History... 10

CA Nimsoft Monitor. Probe Guide for DNS Response Monitoring. dns_response v1.6 series

BASHO DATA PLATFORM SIMPLIFIES BIG DATA, IOT, AND HYBRID CLOUD APPS

Scaling up = getting a better machine. Scaling out = use another server and add it to your cluster.

Apache Tomcat. Load-balancing and Clustering. Mark Thomas, 20 November Pivotal Software, Inc. All rights reserved.

DNS SRV Usage June 22, 2011

The Hadoop Distributed File System

Ground up Introduction to In-Memory Data (Grids)

The functionality and advantages of a high-availability file server system

Scalable Internet Services and Load Balancing

Avoid a single point of failure by replicating the server Increase scalability by sharing the load among replicas

MS Windows DHCP Server Configuration

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

Building a Highly Available and Scalable Web Farm

Evaluation of NoSQL databases for large-scale decentralized microblogging

SanDisk ION Accelerator High Availability

Application Layer. CMPT Application Layer 1. Required Reading: Chapter 2 of the text book. Outline of Chapter 2

Transcription:

Redis Cluster a pragmatic approach to distribution

All nodes are directly connected with a service channel. TCP baseport+4000, example 6379 -> 10379. Node to Node protocol is binary, optimized for bandwidth and speed. Clients talk to nodes as usually, using ascii protocol, with minor additions. Nodes don't proxy queries.

What nodes talk about? PING: are you ok dude? I'm master for XYZ hash slots. Config is FF89X1JK Gossip: this are info about other nodes I'm in touch with: A replies to my ping, I think its state is OK. B is idle, I guess it's having problems but I need some ACK. PONG: Sure I'm ok dude! I'm master for XYZ hash slots. Config is FF89X1JK Gossip: I want to share with you some info about random nodes: C and D are fine and replied in time. But B is idle for me as well! IMHO it's down!.

Hash slots keyspace is divided into 4096 hash slots. But in this example we'll assume they are just ten, from 0 to 9 ;) Different nodes will hold a subset of hash slots. A given key "foo" is at slot: slot = crc16("foo") mod NUMER_SLOTS

Master and Slave nodes Nodes are all connected and functionally equivalent, but actually there are two kind of nodes: slave and master nodes:

Redundancy In the example there are two replicas per every master node, so up to two random nodes can go down without issues. Working with two nodes down is guaranteed, but in the best case the cluster will continue to work as long as there is at least one node for every hash slot.

What this means so far? Every key only exists in a single instance, plus N replicas that will never receive writes. So there is no merge, nor application-side inconsistency resolution. The price to pay is not resisting to net splits that are bigger than replicas-per-hashslot nodes down. Master and Slave nodes use the Redis Replication you already know. Every physical server will usually hold multiple nodes, both slaves and masters, but the redis-trib cluster manager program will try to allocate slaves and masters so that the replicas are in different physical servers.

Client requests - dummy client 1. Client => A: GET foo 2. A => Client: -MOVED 8 192.168.5.21:6391 3. Client => B: GET foo 4. B => Client: "bar" -MOVED 8... this error means that hash slot 8 is located at the specified IP/port, and the client should reissue the query there.

Client requests - smart client 1. Client => A: CLUSTER HINTS 2. A => Client:... a map of hash slots -> nodes 3. Client => B: GET foo 4. B => Client: "bar"

Client requests Dummy, single-connection clients, will work with minimal modifications to existing client code base. Just try a random node among a list, then reissue the query if needed. Smart clients will take persistent connections to many nodes, will cache hashslot -> node info, and will update the table when they receive a -MOVED error. This schema is always horizontally scalable, and low latency if the clients are smart. Especially in large clusters where clients will try to have many persistent connections to multiple nodes, the Redis client object should be shared.

Re-sharding We are experiencing too much load. Let's add a new server. Node C marks his slot 7 as "MOVING to D" Every time C receives a request about slot 7, if the key is actually in C, it replies, otherwise it replies with -ASK D -ASK is like -MOVED but the difference is that the client should retry against D only this query, not next queries. That means: smart clients should not update internal state.

Re-sharding - moving data All the new keys for slot 7 will be created / updated in D. All the old keys in C will be moved to D by redis-trib using the MIGRATE command. MIGRATE is an atomic command, it will transfer a key from C to D, and will remove the key in C when we get the OK from D. So no race is possible. p.s. MIGRATE is an exported command. Have fun... Open problem: ask C the next key in hash slot N, efficiently.

Re-sharding with failing nodes Nodes can fail while resharding. It's slave promotion as usually. The redis-trib utility is executed by the sysadmin. Will exit and warn when something is not ok as will check the cluster config continuously while resharding.

Fault tolerance All nodes continuously ping other nodes... A node marks another node as possibly failing when there is a timeout longer than N seconds. Every PING and PONG packet contain a gossip section: information about other nodes idle times, from the point of view of the sending node.

Fault tolerance - failing nodes A guesses B is failing, as the latest PING request timed out. A will not take any action without any other hint. C sends a PONG to A, with the gossip section containing information about B: C also thinks B is failing. At this point A marks B as failed, and notifies the information to all the other nodes in the cluster, that will mark the node as failing. If B will ever return back, the first time he'll ping any node of the cluster, it will be notified to shut down ASAP, as intermitting clients are not good for the clients. Only way to rejoin a Redis cluster after massive crash is: redis-trib by hand.

Redis-trib - the Redis Cluster Manager It is used to setup a new cluster, once you start N blank nodes. it is used to check if the cluster is consistent. And to fix it if the cluster can't continue, as there are hash slots without a single node. It is used to add new nodes to the cluster, either as slaves of an already existing master node, or as blank nodes where we can re-shard a few hash slots to lower other nodes load.

It's more complex than this... there are many details that can't fit a 20 minutes presentation... Ping/Pong packets contain enough information for the cluster to restart after graceful stop. But the sysadmin can use CLUSTER MEET command to make sure nodes will engage if IP changed and so forth. Every node has a unique ID, and a cluster config file. Everytime the config changes the cluster config file is saved. The cluster config file can't be edited by humans. The node ID never changes for a given node. Questions?