Scaling To 1 Billion Hits A Day. Chander Dhall Twitter @csdhall Me@ChanderDhall.com

Similar documents
Database Scalability {Patterns} / Robert Treat

In Memory Accelerator for MongoDB

Practical Cassandra. Vitalii

Big Systems, Big Data

Scalability of web applications. CSCI 470: Web Science Keith Vertanen

Can the Elephants Handle the NoSQL Onslaught?

NoSQL Databases. Institute of Computer Science Databases and Information Systems (DBIS) DB 2, WS 2014/2015

F1: A Distributed SQL Database That Scales. Presentation by: Alex Degtiar (adegtiar@cmu.edu) /21/2013

Introduction to Big Data Training

Big Data Management and NoSQL Databases

Why NoSQL? Your database options in the new non- relational world IBM Cloudant 1

Preparing Your Data For Cloud

Analytics March 2015 White paper. Why NoSQL? Your database options in the new non-relational world

Evaluating NoSQL for Enterprise Applications. Dirk Bartels VP Strategy & Marketing

Evaluation of NoSQL databases for large-scale decentralized microblogging

Introduction to NOSQL

How To Scale Out Of A Nosql Database

MongoDB in the NoSQL and SQL world. Horst Rechner Berlin,

NoSQL Databases. Nikos Parlavantzas

nosql and Non Relational Databases

these three NoSQL databases because I wanted to see a the two different sides of the CAP

A survey of big data architectures for handling massive data

SCALABILITY. Hodicska Gergely. Web Engineering Manager as Ustream. May 7, 2012

GigaSpaces Real-Time Analytics for Big Data

Lecture Data Warehouse Systems

How graph databases started the multi-model revolution

Transactions and ACID in MongoDB

The evolution of database technology (II) Huibert Aalbers Senior Certified Executive IT Architect

Cluster Computing. ! Fault tolerance. ! Stateless. ! Throughput. ! Stateful. ! Response time. Architectures. Stateless vs. Stateful.

Cloud Computing with Microsoft Azure

Comparison of the Frontier Distributed Database Caching System with NoSQL Databases

Challenges for Data Driven Systems

NoSQL for SQL Professionals William McKnight

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

Do Relational Databases Belong in the Cloud? Michael Stiefel

Cloud Computing at Google. Architecture

<Insert Picture Here> Oracle NoSQL Database A Distributed Key-Value Store

Making Sense ofnosql A GUIDE FOR MANAGERS AND THE REST OF US DAN MCCREARY MANNING ANN KELLY. Shelter Island

ZingMe Practice For Building Scalable PHP Website. By Chau Nguyen Nhat Thanh ZingMe Technical Manager Web Technical - VNG

bigdata Managing Scale in Ontological Systems

Benchmarking and Analysis of NoSQL Technologies

Database Scalability and Oracle 12c

A COMPARATIVE STUDY OF NOSQL DATA STORAGE MODELS FOR BIG DATA

Open Source Technologies on Microsoft Azure

Structured Data Storage

So What s the Big Deal?

NoSQL Data Base Basics

NOT IN KANSAS ANY MORE

X4-2 Exadata announced (well actually around Jan 1) OEM/Grid control 12c R4 just released

MySQL Cluster New Features. Johan Andersson MySQL Cluster Consulting johan.andersson@sun.com

SQL VS. NO-SQL. Adapted Slides from Dr. Jennifer Widom from Stanford

INTRODUCTION TO CASSANDRA

Cloud Computing Is In Your Future

Realtime Apache Hadoop at Facebook. Jonathan Gray & Dhruba Borthakur June 14, 2011 at SIGMOD, Athens

Overview of Databases On MacOS. Karl Kuehn Automation Engineer RethinkDB

Scalable Architecture on Amazon AWS Cloud

JBoss & Infinispan open source data grids for the cloud era

NoSQL - What we ve learned with mongodb. Paul Pedersen, Deputy CTO paul@10gen.com DAMA SF December 15, 2011

BASICS OF SCALING: LOAD BALANCERS

Lecture 10: HBase! Claudia Hauff (Web Information Systems)!

Not Relational Models For The Management of Large Amount of Astronomical Data. Bruno Martino (IASI/CNR), Memmo Federici (IAPS/INAF)

Graph Database Proof of Concept Report

Building Scalable Applications Using Microsoft Technologies

InfiniteGraph: The Distributed Graph Database

Large-Scale Web Applications

Big Data With Hadoop

Building Scalable Web Sites: Tidbits from the sites that made it work. Gabe Rudy

Choosing The Right Big Data Tools For The Job A Polyglot Approach

Distributed Storage Systems

NoSQL replacement for SQLite (for Beatstream) Antti-Jussi Kovalainen Seminar OHJ-1860: NoSQL databases

Domain driven design, NoSQL and multi-model databases

Big Data Technologies Compared June 2014

How To Handle Big Data With A Data Scientist

An Approach to Implement Map Reduce with NoSQL Databases

Evaluator s Guide. McKnight. Consulting Group. McKnight Consulting Group

NoSQL systems: introduction and data models. Riccardo Torlone Università Roma Tre

Adding scalability to legacy PHP web applications. Overview. Mario Valdez-Ramirez

MySQL. Leveraging. Features for Availability & Scalability ABSTRACT: By Srinivasa Krishna Mamillapalli

Common Server Setups For Your Web Application - Part II

1. Comments on reviews a. Need to avoid just summarizing web page asks you for:

The CAP theorem and the design of large scale distributed systems: Part I

MySQL és Hadoop mint Big Data platform (SQL + NoSQL = MySQL Cluster?!)

Cloud Scale Distributed Data Storage. Jürmo Mehine

Future-Proofing MySQL for the Worldwide Data Revolution

High Throughput Computing on P2P Networks. Carlos Pérez Miguel

HBase A Comprehensive Introduction. James Chin, Zikai Wang Monday, March 14, 2011 CS 227 (Topics in Database Management) CIT 367

Integrating Big Data into the Computing Curricula

Distribution transparency. Degree of transparency. Openness of distributed systems

Developing Scalable Java Applications with Cacheonix

Advanced Data Management Technologies

DISTRIBUTED SYSTEMS [COMP9243] Lecture 9a: Cloud Computing WHAT IS CLOUD COMPUTING? 2

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William

BRAC. Investigating Cloud Data Storage UNIVERSITY SCHOOL OF ENGINEERING. SUPERVISOR: Dr. Mumit Khan DEPARTMENT OF COMPUTER SCIENCE AND ENGEENIRING

How To Use Big Data For Telco (For A Telco)

Transcription:

Scaling To 1 Billion Hits A Day Chander Dhall Twitter @csdhall Me@ChanderDhall.com

About Me Microsoft MVP Tech Ed Speaker Asp.NET Insider Web API Advisor Pluralsight Author Dev Chair - Dev Connections

About Me Conference Organizer - jssaturday Leader NodeLA user group Leader.NET user group at UTDallas Owner Chander Dhall, Inc. Conference Organizer MVPMIX.com Chander Tech Podcast

Free Resharper http://chanderdhall.com/codecamp

Why? Amazon claim Just an extra 1/10 th of a second on their response times will cost them 1% in sales. Google ½ a second increase in latency caused traffic to drop by a fifth.

Theory of Scaling #devconnections

Practice of Scalability #devconnections

Agenda Why is it important to scale? Creating a scalable solution (in incremental steps) Propose an Architecture Identify Failures and Bottlenecks Identify Downtime Apply a better solution Repeat until we solve (in 10 steps) Then some bonus stuff (a better solution)

Unfortunate Solution Load Balancer S S S S S Services

Gilbert and Lynch white paper Network 1 Network 2 A Write Algorithm { name : Chander, gender : m } B Read Algorithm { name : Chander, gender : m }

Update Message SCALING TO 1 BILLION HITS A DAY Happy path scenario Network 1 Network 2 A Write Algorithm { name : Dhall, gender : m } B Read Algorithm { name : Chander, gender : m }

Happy path scenario Network 1 Network 2 A Write Algorithm { name : Dhall, gender : m } B Read Algorithm { name : Dhall, gender : m }

Update Message SCALING TO 1 BILLION HITS A DAY Network partitions Network 1 Network 2 A Write Algorithm { name : Dhall, gender : m } B Read Algorithm { name : Chander, gender : m }

CAP Theorem Consistency Availability Partitioning

Brewer s CAP Theorem Consistency (or more appropriately Atomic) Availability Partition Tolerance No set of failures less than total network failure is allowed to cause the system to respond incorrectly Gilbert & Lynch

Just FYI Consistency (in CAP theorem) Atomicity (in ACID) Consistency(in ACID) Means any transaction will bring the database from one valid state to another.

Fallacies of Distributed Computing Network is reliable. Latency is zero. Bandwidth is infinite. Network is secure. Topology doesn t change. There is one administrator.

Fallacies of Distributed Computing Transport cost is zero. Network is homogenous.

Why is Scalability Important Instant success Thanks to Social networking Twitter: 200 billion tweets per year Facebook: 1.23 billion active monthly users a month Billions of devices (desktops, tablets, mobile) Need: Millions of hits with Zero downtime

Why is Scalability Important The website was working great UNTIL we launched Instagram was down on the launch day

The Variables Scalability - Number of users / sessions / transactions / operations the entire system can handle Performance Optimal utilization of resources Responsiveness Time taken per operation Availability Probability of the application being available at any given point in time Downtime Impact - The impact of a downtime of a server/service/resource - number of users, type of impact etc

Major Factors Platform selection Hardware Application Design Database/Datastore Structure and Architecture Caching strategy Asynchronous processing Deployment Process and Architecture Monitoring mechanisms and more

Step 1 Appserver & DBServer App Server Database Server 23

Step 2 Vertical Scaling Appserver & DBServer App Server Database Server Throw more RAM and CPU

Step 2 - Vertical Scaling. or Scale up Increasing the hardware resources without changing the number of nodes Disadvantages Law of diminishing returns Downtime Increases Downtime Impact Incremental costs increase exponentially

Step 3 Vertical Partitioning (Services) Introduction Deploying each service on a separate node Advantages Increases Availability (per app) Easy to tune and optimize Reduces context switching Simple to implement App Server Db Server

Step 3 Vertical Partitioning (Services) Disadvantages Sub-optimal resource utilization May not increase overall availability Finite Scalability App Server Db Server

Vertical Partitioning Distribute the responsibilities. Increased number of nodes. Each node (or cluster) performs separate Tasks Each node (or cluster) is different from the other

Step 4 Horizontal Scaling Load Balancer DB Server

Horizontal Scaling Replication of nodes Nodes perform the same tasks Nodes are identical Scale out

Sticky Sessions Subsequent requests from a user are sent to the original server Asymmetrical load distribution Downtime Impact Loss of session data User 1 Load Balancer

Central Session Store Session store is a single point of failure Session reads and writes generate Disk + Network I/O Load Balancer Session Store Ap p S E R V E R

Clustered Session Management No Single point of failure Session reads are instantaneous Session writes generate Network I/O Increase in number of nodes increases Network I/O exponentially What happens when? User request arrives before intranode communication finished Intra-node communication fails Clustered Session Management Load Balancer

Recommendations Use scaled version of a Central Session Store (Recommended) Use Clustered Session Management ONLY if you have Smaller Number of App Servers Fewer Session writes Don t use sticky sessions if you want to scale

Load Balanced App Server Cluster Active-Active assumes that each LB is independently able to take up the load of the other Load Balancer Users Load Balancer

Step 5 Vertical Partitioning (Hardware) Load Balancer Load Balancer DB Server SAN

Step 5 Vertical Partitioning (Hardware) Advantages Allows Scaling Up the DB Server Boosts Performance of DB Server Disadvantages Increases Cost

Step 6 Horizontal Scaling (DB) Introduction Increasing the number of DB nodes Referred to as Scaling out the DB Server Options Shared nothing Cluster Real Application Cluster (or Shared Storage Cluster)

Step 6 Horizontal Scaling (DB) Load Balancer Load Balancer DB Server SAN

Step 6 Horizontal Scaling (DB) Load Balancer Load Balancer DB Replica DB Server DB Server DB Server SAN

Step 7 Vertical / Horizontal Partitioning (DB) Introduction Increasing the number of DB Clusters by dividing the data Options Vertical Partitioning - Dividing tables / columns Horizontal Partitioning - Dividing by rows (value)

Step 7 Vertical / Horizontal Partitioning (DB) Load Balancer Load Balancer DB Cluster DB Server DB Server DB Server SAN

Step 7 Vertical / Horizontal Partitioning (DB) App Cluster Vertical Partitioning Db Cluster 1 Db Cluster 2 Twitter Table Facebook Table Users Table Products Table

Step 7 Vertical / Horizontal Partitioning (DB) App Cluster Horizontal Partitioning 1st Million Users 2nd Million Users Db Cluster 1 Db Cluster 2 Twitter Table Facebook Table Twitter Table Facebook Table

Step 7 Vertical / Horizontal Partitioning (DB) Load Balancer Load Balancer Hash Map SAN DB Cluster DB Cluster DB DB DB DB DB DB

Step 8 Separating Sets Global Redirector Global Look up Hash Map Load Balancer Load Balancer Load Balancer Load Balancer Hash Map Hash Map DB Cluster DB Cluster DB Cluster DB Cluster DB DB DB DB DB DB DB DB DB DB DB DB Set 1-10 Million Users Set 11-20 Million Users

Step 9 Caching Add caches within App Server Object Cache Session Cache API cache Page cache Software Memcached Redis Azure Cache (App Fabric)

Step 10 HTTP Accelerator A good HTTP Accelerator / Reverse proxy performs the following Redirect static content requests to a lighter HTTP server (lighttpd) Cache content based on rules Use Async Non blocking IO Maintain a limited pool of Keep-alive connections to the App Server Intelligent load balancing Solutions Nginx (HTTP / IMAP) Perlbal Hardware accelerators plus LBs

More Important Stuff CDNs IP Anycasting Async Nonblocking IO (for all Network Servers) If possible - Async Nonblocking IO for disk Incorporate multi-layer caching strategy where required L1 cache in-process with App Server L2 cache across network boundary L3 cache on disk Grid computing

Scalability and Performance SCALING TO 1 BILLION HITS A DAY 14000 12000 10000 NoSql Vs Relational Memcached Key Value Document Databases 8000 6000 4000 2000 0 Relational Databases 0 2000 4000 6000 8000 10000 12000 14000 16000 Depth of Functionality

NoSql vs Relational No Joins Do you need them though? Transactions RDBMS great for concurrency, integrity or data type validity.

Relational -> NoSql Ever increasing users. Scalability needs. Highly structured data to structured, semistructured and unstructured data. Advent of high speed data networking. Distributed computing. Cheap and plenty memory.

Relational -> NoSql http://www.couchbase.com/why-nosql/nosql-database

Scaling RDBMS RDBMS sharding Highly disruptive to re-shard. Lose benefits of relational model. Create and maintain schema on every server.

Scaling RDBMS Denormalizing Why use a RDBMS?

Scaling RDBMS Distributed caching for RDBMS (eg: memcached) Speed up reads only. Cold cache thrash. Management costs.

Relational -> NoSql Schemaless. Auto-sharding. Distributed querying. Integrated caching.

No SQL Types Key Value Ordered key value Wide Column Store Document Store/Full Text Search Graph DBs Object DBs

Key-Value Store Pros Simple. Programmer friendly. Powerful. Fast. Cons Key range support not good. Aggregation support lacking. Key Key Key Key Key Value Value Value Value Value

Ordered Key-Value Store Pros Processes key ranges. More powerful. Cons No framework for value modeling. Key Key Key Key Key Value Value Value Value Value

Big Table Key Value Pros Model values as maps of maps of maps. Key Value Key Value Cons Key Value Not appropriate for schemes arbitrary complexity. Key Value

Big Table Pros Model values as maps of maps of maps. Cons Not appropriate for schemes arbitrary complexity.

Big table Key Column family Key Column family

Document/Full-Text Pros Collection of documents which contain keyvalue collections. Natural data modeling. Programmer friendly. Web based. Mostly REST/Json friendly.

Document/Full text search databases Key Key Key Val Val Val Person : { name : Chander Dhall, address : { city : los angeles, state : CA, zip : 90069 } }

Graph databases Key Key Key Key

Step 12- Finalizing Caching Load Balancer Load Balancer Hash Map Offline Processing DB Cluster DB Cluster Master DB DB DB DB DB DB Master Slave Slave Slave Slave No Sql SAN Search Db

osql Paradigm - Denormalization Data duplication and denormalization First class citizens. Increases total data volume. Simplifies query processing esp. in a distributed environment.

NoSql Paradigm - Atomic Aggregates Checking Id Min bal Account Id Account No. Savings Id Interest rate Account { Type : Checking, Id : chk123, Min Bal : 10000, } Account { Type : Savings, Id : sav123, Interest Rate : 5%, }

No Sql Paradigm No joins Sql Joins query time. Hence, performance penalty. Handled in application instead.

No Sql Paradigm Enumerable keys Sequential Ids for composite keys eg. DeptId_employeeId. Group into buckets sorted by timestamp, day, week etc.

No sql paradigm Index table Employe e Id Details 1234 Email: a@a.com; State: CA; Dept: IT 8235 Email: b@b.com; State: TX; Dept: Sales State Employee Id Dept Employee Id 2234 CA 1234, Email: 1235, c@c.com; 1236, 1244 State: AL; Dept: IT IT 1234, 1235, 1236, 1244 TX 8000, 8100, 8235, 8266 Sales 8000, 8100, 8235, 8266 1671 AL 2212, Email: 2221, c@d.com; 2234, 2256 State: WA; Dept: Sales Acc 2212, 2221, 2234, 2256

No sql paradigm Tree Index Country - USA State - CA City - LA { property : [{ facilityname : abc, facilityid : 111 }, { facilityname : xyz, facilityid : 222 }] } Properties Facilities

No sql paradigm Composite Key E M P L O Y E E S IT Employees Sales Employees IT: Software: 1123 IT: Software: 2323 IT: Hardware: 6767 Sales: Online: 832 Sales : Online: 423 Sales : Store : 556 EmpName: John; Address: Los Angeles EmpName: Kevin; Address: Dallas, TX EmpName: Matt; Address: San Francisco EmpName: Katie: Address: Austin, Tx EmpName: Karen: Address: Irvine, CA EmpName: Richard; Address: San Diego Dept= IT:* or Dept= Sales:Online:*

No sql paradigm - Grouping U123: O111 Product Ids: [ Surface, xbox ] U124:O123 U124:O234 U124:O999 U125:O789 U125:O945 Product Ids: [ Win 8, xbox ] Product Ids: [ Win phone, surface ] Product Ids: [ office, azure sub ] Product Ids: [ msdn, office ] Product Ids: [ surface, xbox ] Colocation of a users data.

nverted search & direct aggregation EmpId, dept, city,. Dept-IT: [111, 123, 234.] Dept-Sales:[673, 343, 434.] 111: Dept-Sales, City: LA 222: Dept-IT, City: Dallas. City: Dallas City: LA

No sql paradigm Materialized paths Electronics TV Phones Computers Cameras Samsung Apple LG LCD LED

No sql paradigm Materialized paths TV { entity : TV, category : Electronics } { entity : Samsung, category : Electronics, TV } Samsung Apple LG LCD LED { entity : Samsung, category : Electronics, TV, LCD }

No sql paradigm Nested sets Electronics TV Phones Samsung Sony Cell Landline 1 2 3 4 5 6 7 8 9 10 11 12 13 14

No sql paradigm Nested sets Electronics TV Phone Samsung Sony Cell Landline 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Flattening nested documents Name: Chander Hadoop: Expert { name : chander, skills : hadoop, nodejs, Spanish, level : expert, expert, novice } Nodejs: Expert Spanish: Novice Skills:hadoop AND level:expert

Flattening nested documents Name: Chander Hadoop: Expert Nodejs: Expert Spanish: Novice { name : chander, skills_1 : hadoop, skills_2 : nodejs, skills_3 : spanish, level_1 : expert, level_2 : expert, level_3 : novice }

References http://www.couchbase.com/whynosql/nosql-database Highly scalable blog. www.10gen.com http://couchdb.apache.org/ www.ravendb.net

References http://redis.io/ http://neo4j.org/ http://cassandra.apache.org http://elasticsearch.org http://memcached.org/ Building Scalable Architecture