My experience writing a DR service for CloudStack Alena Prokharchyk Citrix @Lemonjet
What is a disaster for the cloud Disaster for the Cloud is hardware/software failure,network/power outage, physical damage to the data center (DC) Disaster can cause partial or entire DC failure As a result, VMs become unresponsive and needs to be restored in another DataCenter DR products goal is to prepare VM s for failover and recover them in a short time frame
Existing DR solutions in CS Recurring snapshots feature! No out-of-box cross zones recovery solution
What new DR service does Lets admin to configure recovery service w/o putting extra scripts and config files Prepares for disaster and restores VM and all its metadata - Networks/Networking rules Recovers VM cross zones Real time updates for the recovery VMs' metadata - helps to keep MTTR (Mean Time to Repair) low Provides tiered DR service - most important apps/ accounts can be recovered first
Things DR service doesn t cover No Storage replication is done by DR service, only metadata replication Storage replication is covered by the admin outside of CS (NetApp s Snapmirror)
Which version of Cloudstack is supported by DR? DR works with: Cloudstack 4.5 version Next Citrix CloudPlatform release based on ASF 4.4
Design principles followed while writing the DR Develop as a CS plugin in V1 with ability to run as a separate service in the future versions No changes to core/server CS code that are specific just to DR No direct access to CS DB. All data manipulation through CS APIs only DR service doesn t have its own DB in Version 1. All DR data is stored in CS DB in form of resources metadata Rely on MTBF (Mean Time Between Failures) to be high. Never fail VM in original zone if its preparation fails, let admin fix things and retry
DR Service deployment DR service CloudStack DR UI plugin DR Events listener DR UI plugin DR API plugin DR API plugin DR Server CS UI CS API CS Orchestration engine CS Services /Plugins DR Service DR Events Event listener message bus
DR process Configuration - configuring the DR service Preparation - preparing VM for failover Failover - failing over the vm to the Recovery zone Failback - failing back the vm to its Original zone
Configuration DR Setup Active zone with the Recovery zone Configure DR offerings (SLAs) Tag storages for the DR VMs volumes placement
Preparing VM for failover DR service listens to events from CS, and deploys/ updates a recovery VM metadata in the Recovery zone Recovery Vm doesn t occupy physical resources on the CS side Recovery VM is invisible to an end user
Preparing VM for failover Active zone Recovery zone UserVm Nic1 DR Service UserVm Nic1 Nic 2 Nic 2
Failover process Process of restoring failed vm in the recovery zone DR doesn t do automatic indication that the Disaster happens DR admin triggers failover for the VM by calling the DR API DR service performs the failover process
Failover process Active zone UserVm UUID1 DR Service Recovery zone UserVm UUID1 Volume1 Volume1 Volume2 CS storage1 Volume2 CS storage2 Physical storage1 Volume1 Volume2 NetApp SnapMirror Physical storage2 Volume1 Volume2
Failback process Process of moving VM back to its original zone Vm metadata is preserved in the original zone and re-used when vm is recovered Recovery VM s volumes get re-introduced to the original zone, and attached to the original vm VM in the recovery zone gets disabled VM in the original zone gets enabled UUID swap happens
DR metadata in CS DB CS DB user_vm id name zone_id user_vm_details vm_id detail_name detail_value 1 VM-user1 1 1 DR_RECOVERY_ID 2 2 VM-user1 2 1 DR_STATE 1 DR_ALERT FAILED_TO_PREPARE_FOR_ DR Failed to attach Nic to the Recovery vm
Who controls the DR process Admin controls recovery process on behalf of users VMs End user can monitor: - DR state of his VMs - Ready to Failover / FailedOver - Recovery zone info - to which zone the VM recovers in case of failure - Recovery public ip address(es) info - to reconfigure his public DNS
CS API enhancements Added some missing data to CS API responses Added missing resource_details tables for some CS resources Put in the support for CS services to publish Alerts via CS APIs Introduced External UUID management Implemented resource creation with delayed start for some objects (VPC)
Things yet to fix on CS Single sign on is missing Resource creation in the DB and actual implementation are not granular enough
Summary If you are an API developer for open source IaaS product: Always think from an end user/customer use case perspective while adding/modifying end user APIs Look out what plugins/services/bug fixes people write for your software. Helps to define missing pieces/common problems in your software