MOVING THE ELEPHANT IN THE ROOM Data Migration at Scale
WHO AM I? BDPA Los Angeles Chapter 4 year HSCC participant Columbia University, CC 14 Conductor, Inc. linkedin.com/in/calltyrone 2
CONDUCTOR, INC. Web Presence Management SAAS Big data Collect 6TB of raw web data a week Scalable Collection & ETL pipelines Final Product: reports 6 years running Tons of data! 3
WHY WE CARE ABOUT SCALABILITY More users More data Systems have to keep up! 4
SCALABILITY IN THE REAL WORLD Yesterday s solution is tomorrow s problem Under-prioritized It s hard! Can require massive changes No cure-all 5
WHY REPLACE AN UNSCALABLE SYSTEM? Save money Improve performance Improve reliability Clear the way for progress 6
WHY NOT? If it ain t broke Significant Resource Investment Time Money Software Downtime Data Quality Concerns 7
BUT IT S SO SIMPLE! Identify an unscalable system Discover and vet a suitable successor Replace the legacy system with the new system, while minimizing risk and cost 8
TALKING ABOUT THE ELEPHANT Identifying an Unscalable System
CASE STUDY: LEGACY REPORTING DATABASE Overview MySql Multi-dimensional report data stored in normalized manner across many tables Helpful for initial modeling of our problem space Hosted by a single, very powerful machine 10 Talking about the Elephant: Diagnosing an Unscalable System
CASE STUDY: LEGACY REPORTING DATABASE Unsustainable Powerful EC2 hardware isn t cheap. Vertical Scaling Capacity issues? Get a bigger machine. Obsolete Schema Difficult to backup Queries aren t getting any faster. 11 Talking about the Elephant: Diagnosing an Unscalable System
SEE FOR YOURSELF If your solution Scales vertically Prevents progress Can t perform at scale Is difficult/slow/expensive to upgrade It s time for a change! 12 Talking about the Elephant: Diagnosing an Unscalable System
FINDING A BIGGER ROOM Vetting Scalable Alternatives
WHAT TO LOOK FOR Price-efficient Ease of maintenance Horizontal Scaling 14 Finding a Bigger Room: Vetting Scalable Alternatives
CASE STUDY: AWS S3 DATASTORE Our Use Case Write once, read many De-normalized reports High storage capacity High Availability 15
CASE STUDY: AWS S3 DATASTORE Technical Overview Poor write performance, great read performance Flat files No defined space limit Configurable file replication 16 Finding a Bigger Room: Vetting Scalable Alternatives
CASE STUDY: AWS S3 DATASTORE Benefits Cheap Elastic Architecture facilitates testing Easy to back up 17 Finding a Bigger Room: Vetting Scalable Alternatives
CASE STUDY: AWS S3 DATASTORE Caveats Eventual Consistency Switching to a non-relational solution is nontrivial Application code must change Migration path gets complicated 18 Finding a Bigger Room: Vetting Scalable Alternatives
MOVING THE ELEPHANT Migrating Legacy Data to the New System
INITIAL CONSIDERATIONS Time Frame Scheduling Constraints Operational Cost Resource Constraints Standards for data parity 20 Moving the Elephant: Migrating Legacy Data to the New System
CASE STUDY: OUR UPFRONT PLANNING Two-month finish line Developed COGS models Built data validation software 21 Moving the Elephant: Migrating Legacy Data to the New System
IDEAL MIGRATION SOFTWARE CHARACTERISTICS Can be scaled up or down Speed up to save time Slow down to save resources Can be run in a testing capacity Configurable data sources/sinks Configurable hardware resource use 22 Moving the Elephant: Migrating Legacy Data to the New System
OUR MIGRATION SOFTWARE Oozie and Hive Controllable time/resource tradeoff Testable in a qa environment 23
AN INCREMENTAL MIGRATION: PARTITIONING DATA Easy to track progress Enables concurrency Dilutes failure risks E.g. Conductor Time Periods 24 Moving the Elephant: Migrating Legacy Data to the New System
AN INCREMENTAL RELEASE Limit client exposure to bugs Crowd-source intensive QA Incorporate customer feedback Demonstrate progress early E.g. Conductor Searchlight 3.0 Beta Program Got customers excited Helped to find bugs 25
26 YOU CAN DO IT!
Thanks for Listening! QUESTIONS? 27
28 (We re Hiring!)