BDT312 Using the Cloud to Scale from a Database to a Data Platform Ryan Horn, Lead Software Engineer at Twilio November 12, 2014 Las Vegas 2014 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.
Hi, I m Ryan Tech Lead of the User Data team at Twilio
What is Twilio? We provide a communications API that enables phones, VoIP, and messaging to be embedded into web, desktop and mobile software.
How Does it Work? A user calls your number Twilio receives the call Your app responds
What is the User Data Team? We scale Twilio's backend database infrastructure We build customer facing data APIs We manage data policies and data security at rest
Databases at Twilio
Calls and Messages are Stateful Queued Queued Ringing Sending In Progress Sent Completed Delivered
In the Beginning All data was placed in the same physical database regardless of where the call or message was in its lifecycle.
The Monolithic Database Model API Carriers Web MySQL Call/Message Service Billing
Problems at Scale Many consumers of data Data with different performance characteristics Failure in the database degrades many services Horizontal scaling and orchestration is complicated
Moving to a Service-Oriented Architecture
What is a Service-Oriented Architecture? An architecture in which required system behavior is decomposed into discrete units of functionality, implemented as individual services for applications to compose and consume.
Communicate Through Interfaces, Not Databases Carriers API Call/Message Service Web In Flight Service In Flight MySQL Billing Post Flight Service Post Flight MySQL
Database Can Change Without Changing Every Service Carriers API Call/Message Service Web In Flight Service In Flight MySQL Billing Post Flight Service Post Flight Amazon DynamoDB
SOA Doesn t Solve Everything No matter how many services you put in front of MySQL, it s still a single point of failure.
Sharding MySQL
Implementing Sharding (the easy part) 1. Choose partitioning scheme 2. Implement routing logic 3. Send application queries through router 4. Go!
Sharding at Twilio Shard0 0-3 3-6 Application Router Shard1 6-9 Shard2
Rolling it Out With Zero Downtime (the hard part) We provide a 24/7, always on service Communications is intolerant of inconsistency and latency There is no maintenance window
Bringing Up a New Shard Master1 0-9 Application Slave1 Master2 Slave2
Split Odds and Evens for Writes Master1 Odds 0-9 Application Slave1 Master2 Evens Slave2
Update Routing Master1 Odds 0-4 Application 5-9 Slave1 Master2 Evens Slave2
Cut Slave Link Master1 0-4 Application 5-9 Slave1 Master2 Slave2
New Solutions, New Problems
A Necessary Burden In the beginning, the burden of managing our own databases was non-negotiable.
The Landscape has Changed We now have a variety of managed database services which solve these problems for us, such as Amazon RDS, Amazon DynamoDB, Amazon SimpleDB, Amazon Redshift, etc.
Cost Is Never Optimized Application developers do not (and should not) optimize for database cost.
Self Managed Databases are Costly Everything Else 22% Databases 78% Source: Twilio Data Usage
Keeping up With Growth As growth continues to accelerate, we need to somehow keep up.
A Change in Approach Change our hiring practices and bring in specialists Remove the context switching
Focusing on What We Do Well
Adopting Amazon DynamoDB
Thinking in Terms of Throughput Amazon DynamoDB allows us to scale in terms of throughput, not machines. This is the future of resource provisioning.
Operations Management and scaling of our cluster is fully abstracted away from us.
Cost Compared to MySQL Amazon DynamoDB 18% MySQL 82% Source: Twilio Data Usage
Cost with MySQL Fully Replaced Databases 39% Everything Else 61% Source: Twilio Data Usage
A Relational Model with Amazon DynamoDB Many of our services allow for querying data in a way that maps naturally to a relational database.
GET /Accounts/2/Events SELECT * FROM events ORDER BY date DESC;
SELECT * FROM events WHERE IpAddress= 5.6.7.8 ORDER BY date DESC;
GET /Accounts/2/Events?IpAddress=5.6.7.8&Date<=2014-10-03 SELECT * FROM events WHERE IpAddress= 5.6.7.8 AND Date<= 2014-10-03 ORDER BY date DESC;
GET /Accounts/2/Events AccountId=2, ScanIndexForward=false AccountId (Hash) Date (Range) IpAddress_Date Type 2 2014-10-03 5.6.7.8 2014-10-03 call 2 2014-10-01 5.6.7.8 2014-10-01 message
GET /Accounts/2/Events?IpAddress=5.6.7.8 AccountId=2, IpAddress_Date begins with 5.6.7.8, ScanIndexForward=false AccountId (Hash) IpAddress_Date (Range) Date 2 5.6.7.8 2014-10-03 2014-10-03 call Type 2 5.6.7.8 2014-10-01 2014-10-01 message
GET /Accounts/2/Events?IpAddress=5.6.7.8&Date<=2014-10-03 AccountId=2, IpAddress_Date LT 5.6.7.8 2014-10-03, ScanIndexForward=false AccountId (Hash) IpAddress_Date (Range) Date 2 5.6.7.8 2014-10-03 2014-10-03 call Type 2 5.6.7.8 2014-10-01 2014-10-01 message
Need to Handle Exceeded Throughput Failures Exceeding provisioned throughput is a runtime error.
Handling Exceeded Write Throughput with Amazon SQS Queuing events to Amazon SQS processing asynchronously allows us to gracefully deal with write throughput errors.
API Web Amazon SQS Events Processor Amazon DynamoDB Billing
Maximum of 5 Global and 5 Local Indexes You can manage your own indexes, but your application must then handle partial mutation failures.
Local Index Size Limits Local secondary indexes provide immediate consistency and limit the data set for a given hash key to 10GB.
Data Warehouse
Brief History 2008-2011 All business intelligence queries run on replicas of MySQL clusters serving production traffic.
Brief History 2011-2013 Data pushed to Amazon S3 and queried with Pig, Amazon EMR, improving ability to aggregate, but with high latency.
Brief History 2013 - Present Move to Amazon Redshift cut the time these reports took from hours to seconds allowing us to answer critical BI and financial questions in near real time.
Pushing Data Into Amazon Redshift SQS (DLQ) S3 Warehouse Loader Amazon Redshift Post Flight Service Kafka Amazon S3 Loader
Wrapping Up
Managed Services as a Culture Our focus is on creating an experience that unifies and simplifies communications is a reflection on our adoption of managed services.
Managed Services as a Culture Understanding and focusing on our areas of expertise and leveraging managed services for the rest accelerates the delivery of value and innovation to our customers.
Thank You!
http://bit.ly/awsevals