Douglas Moore Principal Consultant & Architect February 2013 Predictive Analytics with Storm, Hadoop, R on AWS
Leading Provider Data Science and Engineering Services Accelerating Your Time to Value using Big Data IMAGINE Strategy and Roadmap ILLUMINATE Training and Education IMPLEMENT Hands-On Data Science and Data Engineering CONFIDENTIAL 2
Boston Storm 2013-02-28 Meetup Agenda Intro Agenda Project Information Predictive Analytics Storm Overview Architecture & Design Deployment Lessons Best Practices Future Bonus: - Storm & Big Data Patterns CONFIDENTIAL 3
Project Definition AdGlue: Solving biggest problem for local advertisers Where s my ad? Their Needs: Scale up for new business deals More lively site Better predictions Recommendations. Use Cases - Scale batch analysis pipeline; Generate timely stats - Recommendations - Predictions How many page views in the next 30 days? Environment - AWS - Version 1 of site & analytics in production Project Plan - 8-9 weeks - Combined Data Engineering + Data Science Engagement - Staff 1 Arch + 1 PM 1 Data Engineer 2 Data Scientists 3 Client Engineers CONFIDENTIAL 4
Predictive Analytics Process Model Design & Build - Listening & Learning - Discovery (Digging through the data) - Creating a Research Agenda - Testing & Learning Production Predictive Model Development - Data Cleansing, Aggregations, Conditioning - Predictive Model Training Process - Predictive Model Execution Process Challenges: - What functional forms predict future impression counts given counts up to time T? - Robust estimators, like medians rather than means, to cope with outliers - How do we distinguish between new articles, versus old articles we're seeing for the first time? - How well do impression counts correspond to real humans? CONFIDENTIAL 5
Solution based on this approach theory Analyze Massive Historical Data Set Analyze Recent Past Near Realtime Prediction Massive Historical Set = S3 Analyze = Hadoop + Pig + R Recent Past = Storm + NoSQL Analyze = R + Web Service CONFIDENTIAL 6
Storm Overview DAG Processing of never ending streams of data - Open Sourced: https://github.com/nathanmarz/storm/wiki - Used at Twitter plus > 24 other companies - Reliable - At Least Once semantics - Think MapReduce for data streams - Java / Clojure based - Bolts in Java and Shell Bolts - Not a queue, but usually reads from a queue. Related: - S4, CEP Compromises - Static topologies & cluster sizing Avoid messy dynamic rebalancing - Nimbus SPOF Strong Community Support, No commercial support CONFIDENTIAL 7
Storm Concepts Review Cluster - Supervisor - Worker Topology Streams Spout Bolt Tuple Stream Groupings - Shuffle, Field Trident DRPC CONFIDENTIAL 8
Why Storm? Why Realtime? Needed better way to manage queue readers and logic pipeline Much better than roll your own Reliable (Message guarantees, fault tolerant) Multi-node scaling (1MM messages / 10 nodes) It works For more reasons: https://github.com/nathanmarz/storm/wiki/rationale Better end-user experience - View an ad, see the counter move. Need to catch fast moving events - Content half life measured in hours Path to additional real-time capabilities - Trend analysis to recommend hot articles for example. - Ability to bolt on additional analytics CONFIDENTIAL 9
Overall Architecture CloudWatch, SNS (Metrics, Alarms, Notifications) Ad Serving Impression View Ad LB Edge Edge Interactions SQS ElasticCache (Tuple state tracking) Storm - Queue Management - Simple Bot Filtering - Real-time Bucketization - Performance Counters - Event Logging Archive Logs S3 S3S3S3 Management Server Edge DynamoDB EMR (Hadoop) Ad Management Ad Selling LB Edge Performance Counters Impression Buckets Cleansing Model Training Recommendations R Model RDS (MySQL) getprediction R Model Model Parameters CONFIDENTIAL 10
Analytics Architecture EMR (Hadoop) Impression Bucketization Train Model Parameters R Model Impressions Impression Buckets (Batch) Predictive Model Parameters Impression Buckets (Realtime) Storm S3 Adapter Impression Spout Simple Bot Annotator BucketBolt Web Request Impression Prediction R Model CONFIDENTIAL 11
Storm Topology (Greatly Simplified) SQS Event Spout S3 Adapter<T> S3 S3S3S3 SQS SimpleBotFilter Command Spout Performance Counters<T> DynamoDB Adapter<T> Performance CONFIDENTIAL 12
Storm Deployment Storm-deploy project https://github.com/nathanmarz/storm-deploy/wiki Uses pallet & jclouds project to deploy cluster Configured through conf/clusters.yaml & ~/.pallet/config.clj files Pros: - Quick and easy AWS deployment Tip: Use Puppet/Chef for production deployment Cons: - Requires Leinigen v1.x, no warning - Project not kept up to date - Changes & debugging in Clojure - Recovering a node is possible but slow CONFIDENTIAL 13
Lessons Easy to develop, hard to debug - Timeouts Storm Infinite loop of failures - Use Memcached to count # tuple failures At Least Once processing - Hadoop based read-repair job Performance Counters not getting flushed - Tick Tuples Always ACK Batching to S3 - Run a compaction & event de-duplication job in Hadoop CONFIDENTIAL 14
Lessons Understand your timescales - Frequency at which you emit running totals/ averages / stats - Frequency at which you write logs to S3 - Frequency at which you commit to DynamoDB / RDS / Painful tuning procedures when your topology carries lots of tuples - TOPOLOGY_MESSAGE_TIMEOUT_SECS - TOPOLOGY_MAX_SPOUT_PENDING CONFIDENTIAL 15
Storm Best Practices Debug and unit test topology application logic in local mode. - Mock testing - Multiple environments - Exception Handling & Logging When running distributed - Start with small number of workers and slots, with fewer log files to dig through. - Automated deployment Use Metrics - Instrument your spouts and bolts. - Needed when scaling in order to optimize performance. - Helps diagnosis problems. Latest WIP versions of storm add specialized metrics, also improve nimbus reporting. Use test data that is similar to production data. Distribution across topology is data dependent. CONFIDENTIAL 16
Future Improvements Only once semantics - Trident S3 small file sizes - Segment topology just for S3 persistence - Incremental S3 uploads (faster too) DynamoDB costs - Use DRPC to access Time series and metric Deploy using Chef/Puppet - AWS OpsWorks? Revisit analytical models - Compare performance - Compare with other models, do they perform better? - Feature Analysis CONFIDENTIAL 17
Bonus
Storm & Big Data Patterns Transactional Transactional Transactional Source Transactional Source Systems Source Systems Source Systems Systems CRUD Event Edge Edge Server Edge Server Servers Event Event Edge Edge Server Server Devices STORM Parse, Map, Enrich, Filter, Distribute Log Aggregation ETL Dimensional Counts Indexer Analytics Subscription Services DFS System of Record OLAP Fuzzy Search Dashboard Partners CONFIDENTIAL 19
Questions?