Running an E-Commerce Database in the Cloud. Mark Uhrmacher (CTO) Aaron Brown (Senior Systems Engineer) ideeli



Similar documents
High Performance MySQL Choices in Amazon Web Services: Beyond RDS. Andrew Shieh, SmugMug Operations smugmug.

Scalable Web Application

Cloud Based Application Architectures using Smart Computing

Choosing Storage Systems

MySQL performance in a cloud. Mark Callaghan

On- Prem MongoDB- as- a- Service Powered by the CumuLogic DBaaS Platform

Web Application Deployment in the Cloud Using Amazon Web Services From Infancy to Maturity

Tushar Joshi Turtle Networks Ltd

Design for Failure High Availability Architectures using AWS

Migration and Building of Data Centers in IBM SoftLayer with the RackWare Management Module

Amazon Elastic Compute Cloud Getting Started Guide. My experience

Migration and Building of Data Centers in IBM SoftLayer with the RackWare Management Module

EXECUTIVE SUMMARY CONTENTS. 1. Summary 2. Objectives 3. Methodology and Approach 4. Results 5. Next Steps 6. Glossary 7. Appendix. 1.

Amazon Cloud Storage Options

Storage Options in the AWS Cloud

Web Application Hosting in the AWS Cloud Best Practices

Building Success on Acquia Cloud:

Deploying Database clusters in the Cloud

Migration and Disaster Recovery Underground in the NEC / Iron Mountain National Data Center with the RackWare Management Module

The High-Performance Cloud Infrastructure Company! 2011 Joyent, Inc. Contains Joyent Restricted Secrets. Not for Public Disclosure. Patents Pending.!

Software- as- a- Service (SaaS) on AWS Business and Architecture Overview

Migration Scenario: Migrating Backend Processing Pipeline to the AWS Cloud

OTM in the Cloud. Ryan Haney

Amazon Elastic Beanstalk

Deploying Splunk on Amazon Web Services

How AWS Pricing Works May 2015

Building Fault-Tolerant Applications on AWS October 2011

Web Application Hosting in the AWS Cloud Best Practices

Preparing Your IT for the Holidays. A quick start guide to take your e-commerce to the Cloud

Increased Security, Greater Agility, Lower Costs for AWS DELPHIX FOR AMAZON WEB SERVICES WHITE PAPER

MySQL: Cloud vs Bare Metal, Performance and Reliability

DISTRIBUTED SYSTEMS [COMP9243] Lecture 9a: Cloud Computing WHAT IS CLOUD COMPUTING? 2

How AWS Pricing Works

Enabling Database-as-a-Service (DBaaS) within Enterprises or Cloud Offerings

Avoiding Pain Running MySQL in the Cloud

membase.org: The Simple, Fast, Elastic NoSQL Database NorthScale Matt Ingenthron OSCON 2010

Scalability of web applications. CSCI 470: Web Science Keith Vertanen

MySQL Enterprise Backup

Drupal in the Cloud. by Azhan Founder/Director S & A Solutions

High Availability Solutions for MySQL. Lenz Grimmer DrupalCon 2008, Szeged, Hungary

Protect Data... in the Cloud

WHITE PAPER. Header Title. Side Bar Copy. Real-Time Replication Is Better Than Periodic Replication WHITEPAPER. A Technical Overview

Best Practices for Using MySQL in the Cloud

Alfresco Enterprise on AWS: Reference Architecture

High-Availability in the Cloud Architectural Best Practices

MySQL backups: strategy, tools, recovery scenarios. Akshay Suryawanshi Roman Vynar

Amazon Web Services Yu Xiao

Développement logiciel pour le Cloud (TLC)

ZADARA STORAGE. Managed, hybrid storage EXECUTIVE SUMMARY. Research Brief

Drupal in the Cloud. Scaling with Drupal and Amazon Web Services. Northern Virginia Drupal Meetup

Designing Apps for Amazon Web Services

Parallels Cloud Storage

Be Very Afraid. Christophe Pettus PostgreSQL Experts Logical Decoding & Backup Conference Europe 2014

Real-time reporting at 10,000 inserts per second. Wesley Biggs CTO 25 October 2011 Percona Live

Ground up Introduction to In-Memory Data (Grids)

High Availability Solutions for the MariaDB and MySQL Database

PostgreSQL on Amazon. Christophe Pettus PostgreSQL Experts, Inc.

BeBanjo Infrastructure and Security Overview

When talking about hosting

Amazon Web Services Primer. William Strickland COP 6938 Fall 2012 University of Central Florida

Storage and Disaster Recovery

Scaling in the Cloud with AWS. By: Eli White (CTO & mojolive) eliw.com - mojolive.com

Top 10 Reasons why MySQL Experts Switch to SchoonerSQL - Solving the common problems users face with MySQL

TECHNOLOGY WHITE PAPER Jun 2012

Threat Modeling Cloud Applications

TECHNOLOGY WHITE PAPER Jan 2016

ioscale: The Holy Grail for Hyperscale

Scalable Architecture on Amazon AWS Cloud

MAGENTO HOSTING Progressive Server Performance Improvements

How To Fix A Powerline From Disaster To Powerline

BORG DIGITAL High Availability

Scaling Database Performance in Azure

A Generalized Cloud Storage Architecture with Backup Technology for any Cloud Storage Providers

Benchmarking Cassandra on Violin

History of Disaster - The BioWare Community Site

Zadara Storage Cloud A

The Microsoft Large Mailbox Vision

Dimension Data Enabling the Journey to the Cloud

AWS Storage: Minimizing Costs While Retaining Functionality

Bricks Cluster Technical Whitepaper

THE FIRST LOCAL ENTERPRISE CLOUD STORAGE FEATURES. Enterprise iscsi (Block) & NFS/ CIFS (File) Storage-as-a-Service

High Availability & Disaster Recovery Development Project. Concepts, Design and Implementation

CompTIA Cloud+ 9318; 5 Days, Instructor-led

Implementing Microsoft Windows Server Failover Clustering (WSFC) and SQL Server 2012 AlwaysOn Availability Groups in the AWS Cloud

MySQL and Virtualization Guide

ITP 140 Mobile App Technologies. Web Hosting and Cloud by Nathan Greenfield

CompTIA Cloud+ Course Content. Length: 5 Days. Who Should Attend:

Feature Comparison. Windows Server 2008 R2 Hyper-V and Windows Server 2012 Hyper-V

Achieving Zero Downtime for Apps in SQL Environments

Are You Ready for the Holiday Rush?

Using MySQL for Big Data Advantage Integrate for Insight Sastry Vedantam

How To Run An Apa On An Amazon.Com

Developing Scalable Java Applications with Cacheonix

Perforce Disaster Recovery at Google. ! Google's mission is to organize the world's information and make it universally accessible and useful.

never 20X spike ClustrixDB 2nd Choxi (Formally nomorerack.com) Customer Success Story Reliability and Availability with fast growth in the cloud

Big data blue print for cloud architecture

MakeMyTrip CUSTOMER SUCCESS STORY

How To Store Data On A Server Or Hard Drive (For A Cloud)

Architectures Haute-Dispo Joffrey MICHAÏE Consultant MySQL

Transcription:

Running an E-Commerce Database in the Cloud Mark Uhrmacher (CTO) Aaron Brown (Senior Systems Engineer) ideeli

What is ideeli? Fastest growing, members-only online shopping destination" Leader in mass affluent women s market; fast growing men s category 4+ million members First event: December 2007 " 6000+ successful events to date $250 million year-end run rate" Grew 41,000% since launch in 2007 Strong commitment, success in footwear" 189% growth in the past 12 months

Where are we now? 400GB production dataset 3TB reporting dataset more than 24k qps & 80k dynamic rpm at peak A self-inflicted DoS every day at noon 4

System Architecture 5

System Architecture Simple version Ruby on Rails Web Stack: nginx haproxy apache Phusion Passenger Database/Caching memcached Percona Server 5.1 6

AWS Terminology region == data center availability zone == isolated pod EBS == persistent storage (NAS) 7

Replication Strategy Server Locality & Disaster Recovery Master/Master pairs in separate AZs Replica trees stay within same AZ Intraregion replicas in separate AZs Extraregion disaster recovery instance for very bad days 8

Database Failover Semantics (Kinesthetic Learning) 1. In-app failover o db1 unavailable? Try db2! 2. MySQL Proxy (disaster) 3. master/master with manual failover (Finger on the keyboard, pager on the SysEng) 9

EBS RAID The Problem Single EBS volumes - 100-150 iops Disk/Network traffic shared on a multitenant NIC Highly variable disk latency EBS volumes fail in unexpected ways No performance guarantees RAID 10 for performance & reliability Why not RAID0? It must be RAID(1 5 6 10) on the back end, right? Diminishing returns after ~10 EBS volumes Linux software raid (md) 10

Application Tuning Query Reduction Query reduction through code optimization ORMs aren't always so smart Evolution of caching strategies 1. memcached 2. membase 3. back to memcached 4. testing Riak 11

Server Tuning Problems The Problems Limited concurrency Frequent server mini-stalls Slow disk Widely variable disk latency Multitenancy EBS performance 12

Server Tuning "Big" servers production servers are m2.4xlarge - 68 GB RAM / 8x2.66GHz CPU bigger servers == less multi-tenancy == (more) consistent performance innodb_log_file_size = 4GB buffers and optimize writes query_cache_type = 0 caused lockups due to mutex contention 13

Percona Server/Services Multi-second DB lockups at peak Queuing at the load balancer Enlisted Percona Services * See Percona/ideeli case study 14

Percona Server Switched to Percona Server w/ XtraDB Internally caused lockups ceased Response curve flattened 15

Backup Strategies Cold backups Copy backup to S3 Slow recovery time EBS snapshots with XFS filesystems ec2-consistent-snapshot from Alestic works with EBS RAID instant atomic snapshot delayed while snapshot writes nearly instant recovery 16

#ec2pocalypse 17

#ec2pocalypse "stuck" volumes (no iops, 100% utilization) Device: r/s w/s %util sdi2 0.00 0.00 100.10 Stuck volumes were in one AZ, but multiple AZs experienced API failure Reconstruction, an additional benefit of RAIDed EBS volumes 18

#ec2pocalypse Reality check Cross-zone, multi-tiered DR strategy poorly designed Loss of us-east-1a caused both DR replicas to become out of date DR replica should slave directly off of master Binlog retention period was too short 19

#ec2pocalypse What we learned This was a data center outage Something similar could have happened anywhere Possibly with less downtime Data center failure, cloud success Expect Failure Cache heavily No really, expect failure 20

We re hiring ideeli.com/pages/careers jobs@ideeli.com If 24K QPS excites you, come talk to us 21

Appendix 22

AWS Terminology Storage Ephemeral Local storage Free and plentiful Irrecoverable upon instance termination or crash Not well suited for important data EBS - Elastic Block Storage Persistent storage Essentially NAS Bandwidth shared with network Slow, high latency Rich API Maximum individual volume size of 1TB 23

AWS Terminology Topology Regions Separate, geographically isolated data centers Current regions are Virgina, California, Ireland, Tokyo, and Singapore Inter-region network traffic uses public Internet Availability Zones Logically separate zones within a region Failures in one zone should not affect another region Inter-zone, intra-region network traffic on private, high-speed network 24

#ec2pocalypse What happened? Someone pressed the wrong button "re-mirroring storm" of EBS volumes Locked up 13% of volumes in a single availability zone EBS APIs were degraded, making recovery more difficult 25

Lessons Learned Expect failure If you need a database server, you need 2 (maybe 3) in different AZs and regions Use software RAID to protect against individual EBS volume failure. Visually diagram regional and AZ topology Cache heavily Disk is bad, worse on EBS. Try to avoid it. No, really. Expect failure. You will be required to organically test your DR plan. 26