Building highly Available Services on Windows Azure Platform Pooja Singh Technical Architect, Accenture Aakash Sharma Technical Lead, Accenture Laxmikant Bhole Senior Architect, Accenture
Assumptions You know the basics of Web/Worker roles SQL Azure Windows Azure Storage Windows Azure Diagnostics
S E S S I O N A G E N D A Topics Understand Availability Causes for unavailability What you get with Azure What you do on your own Guiding Principles Audience Developers & Architects community People with high available services needs Takeaway Windows Azure inherent attributes for building highly available services Architectural expectations for building highly available services
How do you define Availability What is acceptable Downtime What happens in case of failure All functionality required to be available? Degraded functionality to be available Failsafe Acceptable Performance
Cost of building highly available services Unavailability Vs High Availability Cost & Complexity Availability
INDIA 28-30 September 2011 Implementation costs for a new project Implementation cost for a startup company that offers its software as a service with a hosting company. Traditional Azure 6
Causes for unavailability Increase in workload Non-scalable architecture Poor performance Platform Failures Upgrades Failure Hardware Network Transient conditions
What you get with Azure
Azure to rescue
Azure monthly service level agreement
Azure out-of-box features Elasticity Scale up/down compute resources on-demand Self Service Management Self recovery for nodes Fault Domains Storage Resilience 3 copies of storage Geo Replication Built-in network redundancy
What you need to do
Design for Increased Load
Is this Scalable? Web Role Instance 1 Load Balancer Web Role Instance 2 Web Role Instance 3 SQL Azure Web Role Instance 4
Is this Scalable? Web Role Instance 1 Web Role Instance 2 SQL Azure Load Balancer Queue Worker Role Instance Worker 1 Role Instance 2 Table storage Blob storage
Design for Scalability Use loosely coupled nodes Design for redundancy Scale OUT everything Better to have 50 one GB databases than one 50 GB database Test at scale
Design for Performance Service and data closer to user Same data center to avoid network latency CDN Caching Be mindful of the throughput and transaction thresholds Auto-scaling
How CDN works Contents closer to end-users 24 physical nodes globally CDN works for web apps & public blobs CDN Region A Users in Europe Azure Storage Copy of Blob A Blob A CDN Region B Users in Asia Copy of Blob A
Decide Upgrade Strategies
Upgrade Strategies VIP Swap New Service and DNS swap Upgrade Domains
How does upgrade domain work? Load Balancer DNS Myservice.Cloudapp.net Myservice v1 Myservice v2 Myservice v1 Myservice v2 Myservice v1 Myservice v2
Handle Failure
Fault Tolerance Self recovery Can your Service fix itself? Transaction & Recovery Loosely coupled Transaction rollback and recovery Network Failures Retry Logic
What is Retry logic? When - Network failure or transient conditions Service is temporarily unavailable E.g. SQL Azure Error 40501 The service is currently busy. Retry request after 10 seconds. What - Retry for any external connections SQL Azure Windows Azure Storage Service Bus Any external service How - Use RetryPolicy class or Transient Fault Handling Framework NoRetry Retry RetryExponential
Retry Code Example
Disaster Recovery
Disaster Recovery Backups Fault Domain Geo-replication Traffic Manager Performance Round Robin Failover
How Traffic Manager works Policies Performance Use when geo-distributed services Round Robin Failover Monitoring Large user base Small user base
Traffic Manager Performance Policy Decide which data center to connect East Asia DC Myservice-ea.cloudapp.net Myservice DNS myservice.com Policies Traffic Manager Monitoring North Europe DC Myservice-ne.cloudapp.net Myservice North Central U.S. DC Myservice-ncus.cloudapp.net Myservice
Load Test, Diagnostics & Monitoring Load test your service Visual Studio 2010 Ultimate Load Tests Diagnostics Windows Azure Diagnostics Service Management APIs Storage Management APIs CSS SQL Azure Diagnostics Monitoring Visual Studio profiling tools Windows Azure Management pack for SCOM
Guiding Principles
Guiding Principles Use loosely coupled roles Use of queues promotes loose coupling Handling fault tolerance Recover from fault Handling scalability in architecture Design for scalability Run multiple instances of each role Availability in case of role failure
Guiding Principles Design and code for instance failure Imbibe redundancy Monitor everything Take feedback to recover fast Load test Fail fast
References http://www.microsoft.com/windowsazure http://channel9.msdn.com http://azurescope.cloudapp.net http://blogs.msdn.com http://msdn.microsoft.com http://code.msdn.microsoft.com
THANKS 28-30 September 2011 Please do give your Feedback, Complete evaluation at the end of this session. You could also write to be at : Laxmikant.Bhole@accenture.com