How is a rational (big) data deployment approach like optimizing the generation mix of a power company? John Akred & Stephen O Sullivan @SVDataScience
John Akred @BigDataAnalysis Puppy Daddy PIF Husband Insider Trading (Investigation) VIX Index SPSS Clementine & Text Analysis for Surveys Condition Monitoring HR Analytics Smart Grid 2
Stephen Osullivan @steveos Father of 2 Husband Database Engineer DBA Geek Quartermaster Q 3
Utility Source Mix Variable Load Base Load 180 160 140 120 100 80 60 40 20 0 12:00 AM 1:00 AM 2:00 AM 3:00 AM 4:00 AM 5:00 AM 6:00 AM 7:00 AM 8:00 AM 9:00 AM 10:00 AM 11:00 AM 12:00 PM 1:00 PM 2:00 PM 3:00 PM 4:00 PM 5:00 PM 6:00 PM 7:00 PM 8:00 PM 9:00 PM 10:00 PM 11:00 PM Other Customer Solar Customer Wind UBlity Solar UBlity BaEery UBlity Wind Nuclear Fossil 4
PREVIOUSLY WE ASKED: What is the data type? What is the size of the data? What are the indexes? What are the foreign key constraints? NOW WE ASK: Confidential And Proprietary. 5
TOTAL COST OF OWNERSHIP Bare Metal vs. Cloud Smackdown This result debunks the idea that the cloud is not suitable for Hadoop MapReduce workloads given their heavy I/O requirements. Hadoop-as-a- Service provides a better priceperformance ratio than the bare-metal counterpart. http://www.accenture.com/sitecollectiondocuments/pdf/accenture-hadoop-deployment-comparison-study.pdf 6
We consistently observed that the performance improvement by applying various tuning techniques made a huge impact. TUNING MATTERS! Sessionization Recommendation 21h50m 9h11m First round of tuning Second round of tuning Document Clustering memory space per CPU core impacts the maximum pertask heap space allocation that sessionization requires http://www.accenture.com/sitecollectiondocuments/pdf/accenture-hadoop-deployment-comparison-study.pdf 7
Cloud Specific Cluster- Wide Memory space per CPU core Number of map task slots and reduce task slots per TaskTracker node TUNING PARAMETERS Number of CPU cores Storage per node Task resource allocations 8
Log Collection & Search APPLICATION SERVERS Logs Logs statsd Log Search http http 9
Real-Time Sales Transactions APPLICATION SERVERS DATA CENTER A DATA CENTER B Hue Hue hep hep 10
Identity Matching EDW Push results to Cassandra 11
Sessionization Clickstream 12
POLYGLOT PERSISTENCE Horizontal State Operations Supply Chain Low Latency Event Detection Planning Asset Management Vertical Historical High Latency Predictive Modeling Confidential And Proprietary. 13
METHODOLOGY We iterate to value, answering the most valuable questions as quickly as possible Plan Prove Pilot Production Plan Prove Pilot Production Confidential And Proprietary. 14
FROM EXPERIMENT TO DEPLOYMENT Pilot Workload Optimize Workload Analyze Production Requirements Determine TCO of Deployment Approaches Provision Deployment Confidential And Proprietary. 15
Utility Source Mix Variable Load Base Load 180 160 140 120 100 80 60 40 20 0 12:00 AM 1:00 AM 2:00 AM 3:00 AM 4:00 AM 5:00 AM 6:00 AM 7:00 AM 8:00 AM 9:00 AM 10:00 AM 11:00 AM 12:00 PM 1:00 PM 2:00 PM 3:00 PM 4:00 PM 5:00 PM 6:00 PM 7:00 PM 8:00 PM 9:00 PM 10:00 PM 11:00 PM Other Customer Solar Customer Wind UBlity Solar UBlity BaEery UBlity Wind Nuclear Fossil 16
? questions Yes, We re Hiring svds.com 17
THANK YOU John @BigDataAnalysis Stephen @steveos Slides are here: http://www.svds.com/downloads 18