GoGrid Implement.com Configuring a SQL Server 2012 AlwaysOn Cluster Overview This documents the SQL Server 2012 Disaster Recovery design and deployment, calling out best practices and concerns from the point of view of high availability, disaster recovery, and client connections. This design has been revised to take into consideration lessons learned during the GoGrid deployment and testing. The design assumes Windows Server 2012 for hosting all SQL Server instances. The design uses SQL Server 2012 AlwaysOn Availability Groups for high availability and disaster recovery. Another important consideration, not included in this document, is whether or not to have a Windows Failover Cluster Instance with shared storage failing over at the server level in addition to the availability groups failing over at the database level. Physical or Virtual Hardware Configuration 1. One Active Directory domain controller per datacenter. 2. A single domain with a private network between servers in the domain. 3. An application server. At a minimum, SQL Server 2012 Management Studio and SQLCMD client tools are installed. o Any SQL Server node can also be a client. The issue is the resource that SQL Management Studio (or whatever client) is having on that SQL Server. You don't want other processes to be competing with SQL Server for resource and you definitely don't want SQL Server to page. Run the client from the best place but know the consequences. 4. Windows 2012 servers that will host HA and/or DR instances of SQL Server in the primary datacenter. These servers are cluster nodes without shared storage. 5. Select a primary read-write server for the databases in the availability group (primary replica). Select a server for a synchronous (HA) secondary replica. Any remaining secondary replicas will be asynchronous DR replicas. 6. Windows 2012 servers that will host SQL Server instances in the secondary datacenter. These are both asynchronous DR secondary replicas in the availability group. 7. Bring the servers hosting SQL Server up to date with all hotfixes and patches using Windows Update. 8. Servers in a second GoGrid data center have different subnets. CloudLink can be used to link the private subnets together. Make sure to assign both public and private IPs to the servers. If necessary, configure routers and firewalls between networks. Servers on one network should be able to access servers on the other network. The client applications should be able to log on to any SQL Server on any network. Make sure all Windows Cluster and SQL Server ports are available across firewalls and
GoGrid SQL Server Business Continuity Design and Deployment 2 of 10 routers. SQL Server requires port 1433 for client sessions and port 5022 for mirroring of availability group databases. You will need to write static routes to send private traffic through CloudLink. 9. There is a single SQL Server instance per Windows Server instance. 10. Verify enough resource is available for hosting SQL Server. A server with 4 CPUs and 16GB RAM is a good starting point. 11. Configure partitions on separate spindles, if possible, for the system, SQL Server data files and SQL Server log files. SQL Server installation directories are on the system partition. 12. SQL Server data and log partitions need enough storage to accommodate database files. This can be anywhere from a few gigabytes to terabytes. A test configuration might use 100GB on the data and log file partitions to be divided among databases, to start. 13. Storage for the data file partition will generally see better performance when optimized for reads; log file partitions for writes. Both partitions can be optimized for random file access. Any sequential file access isn t relevant once there are multiple database files sharing the partitions. 14. Consider requirements for storage IOPS. Less than optimal performance on storage can cause SQL Server overall performance to decline rapidly, especially for high transaction volumes that require high performance write partitions. 15. The Tempdb database files should be on data and log partitions, not the system partition. 16. Do not use any write-behind caching. All SQL Server writes should write-through to disk. 17. SQL Server 2012 supports SMB 3.0 network file share locations for database files. 18. If networks are not using IPv6 addresses, disable IPv6 on each adapter. Use static IPv4 addresses. 19. Add the IP addresses of the domain controllers in each datacenters as DNS server addresses on every node, with the DNS server on the same subnet as the node as the first address, and the remote DNS server as the second address. Cluster Services The following steps prepare each node for creating a SQL Server AlwaysOn Availability Group. Each node has independent storage; there are no storage dependencies between nodes. 20. The default configuration is that all cluster nodes in all datacenters have a quorum vote. If a design is intended to disallow remote datacenter servers from having a vote, before installing cluster services a hotfix is required in Windows Server 2008 R2. Windows Server 2012 natively allows the quorum configuration to deny nodes a vote. 21. Each server provisioned from a template that becomes a cluster node or a domain controller must go through a SYSPREP /generalize before adding the node to a cluster instance. 22. Each server hosting a SQL Server instance starts from a base OS installation that is up to date with hotfixes from Windows Update. 23. Each server that is an intended target of a failover should be hosted on a separate physical host without any shared storage. 24. In Server Manager, add the Cluster Services feature on each server hosting a SQL Server instance that has a database that is an availability group replica. Cluster services require.net 3.5.1.
GoGrid SQL Server Business Continuity Design and Deployment 3 of 10 25. In Failover Cluster Manager on the primary node, create a cluster. Specify all the participating nodes in all datacenters. 26. Availability group nodes all have separate storage. Bypass any attempt to add storage as shared storage. There is no need to include storage in the cluster validation test. 27. Create a cluster access point. This is an Active Directory network name. It must have IP addresses on each subnet where cluster nodes exist. You can use one of the IPs in your block for this but you need to make note of it and not use it for other servers or you will an IP address conflict. Since you have a /24 for your private block, it is advisable to use IPs near the backend of the block and mark them as assigned. 28. Before configuring the quorum, if there are an even number of voting nodes in the cluster, create a file share on a server that is not part of the cluster. Use Server Manager to create the share under Roles, File Services, Share and Storage Management. The File Services role may have to be added to Server Manager. Grant full control network and file system permissions to the Active Directory cluster object to the share. The cluster object is an active directory computer object name. Permissions must be granted to the SMB share and file system separately. Once the cluster is running, configure the quorum. Neither disk quorum configuration is applicable in an availability group cluster instance without shared storage. 29. Configure the quorum. If there are an odd number of voting nodes, use the node majority quorum. If there are an even number of voting nodes, use the file share and node majority configuration. The file share can break any ties in nodes voting on the health of the cluster. 30. If the quorum configuration fails due to an error that the file share cannot be found, check that the cluster group owner is a node on the same subnet as the server where the file share was created. From a command prompt, run Cluster Group. C:\>cluster group 31. If the node that owns the cluster group is not the desired node, use the Cluster Group command to change the owning node. C:\>cluster group /move:targetnodenetbiosname 32. You can also use PowerShell to move the cluster group. 33. The secondary replica in a second data center should not be listed a voting member. This is due to latency and also because it is intended as a disaster recovery node and not an active member of the cluster. SQL Server Instances using Availability Groups for HA and DR The following are the steps for deployment of SQL Server instances hosting availability groups. 34. Create a single domain account for the SQL Server service on all nodes to use. All instances of the SQL Server database engine in the availability group should use the same domain account.
GoGrid SQL Server Business Continuity Design and Deployment 4 of 10 Granting the Lock Pages in Memory user right on each node for the SQL Server service account will reduce memory pressure and improve performance. Granting the Perform Volume Maintenance Tasks user right on each node for the SQL Server service account will substantially improve performance of file growth events but carries with it a slight security risk. 35.. Do not select New SQL Server Failover Cluster installation if you are installing SQL Server separately. GoGrid servers can be ordered with SQL Server 2012 pre-installed. An availability group fails over at the database level. Choosing this is the way to create a failover cluster instance for failover at the SQL Server instance level. The default instance name MSSQLSERVER should be used. Select features desired for the installation: the Database Engine, Client Tools Connectivity, Client Tools Backwards Compatibility, and Management Tools Basic and Complete should be chosen at a minimum. 36. Specify the domain user account for the SQL Server service. Enable the SQL Server Agent service to start automatically if the SQL Server will run scheduled jobs or maintenance plans. Keep the SQL Server Browser service in a disabled state when using the default MSSQLSERVER instance and port 1433 for client sessions. 37. On the Server Configuration page, choose at least one domain user to be a SQL Server administrator. If you will be a SQL Server administrator, be sure to Add Current User to the collection of administrators. If your policy is that Domain Admins are also SQL Server administrators, add them along with any other Windows user or group that will be able to administer SQL Server. 38. On the Data Directories tab specify the partitions created for SQL Server data files and log files as the defaults for user databases. The drives and directories where availability group database data files sand log files are located should be the same on all nodes in order for the Create Availability Group Wizard to sync the databases. Otherwise, backup and restore must be manual to sync the primary replica with secondary replicas where the file paths are different. 39. If a node is not a member of a failover cluster instance, join the node to the cluster instance before creating the availability group. All servers hosting availability group databases must first be nodes in a cluster. Cluster services runs on all the nodes and establishes a connections to make the entire cluster effectively a single administrative unit. The voting resources in the cluster quorum, e.g. the nodes that have a vote and the file share, keep track of the cluster health and availability and communicate that to SQL Server. When a node fails, the cluster services on the other nodes have that information, and therefore SQL Server on the primary node also has that information, and knows the failed node cannot be used for failover.
GoGrid SQL Server Business Continuity Design and Deployment 5 of 10 The different between a cluster for an availability group and a cluster for failover cluster instance is that SQL Server is responsible for the database mirroring from the primary to secondary nodes and moves the primary replica in an availability group to a different node during failover. The cluster service only tracks health of an availability group and sends and receives messages with SQL Server on the primary node. SQL Server tells the cluster service which node is the primary node. 40. When SQL Server installation is complete and the server node is a member of a cluster instance, enable SQL Server availability groups from the SQL Server configuration manager, properties of the MSSQLSERVER service, on the AlwaysOn Availability Group tab. If the server is not part of a cluster instance, this property setting will not be available. Restart the service. Repeat for all nodes. 41. Create at least one database that will be part of the availability group on the primary node, even if it is a temporary database used only for this purpose. 42. Create a file share to store SQL Server backups accessible from all nodes in the availability group to sync availability group replicas. The SQL Server service account for the replicas will need network and file system permissions to read and write to the file share. You can use the storage on a node as the file share. It is not advisable to use Cloud Storage for this purchase as it is intended to operate as a NAS and not as block storage. 43. Before creating the availability group, any database that is a member of the availability group must be backed up at least once. Create a full database backup if a database is new. If the backup share used to sync the availability group replicas has not been used before, this will be a test to see that the SQL Server service account has the necessary permissions. You can use Windows backup to manually backup the file share to Cloud Storage if you so desire. 44. An availability group contains a single primary replica for clients connecting to write to the databases. Choose which node will host the primary replica. SQL Server will automatically change the cluster group owner as necessary. 45. Choose which nodes will be secondary replicas. Of these secondary replicas, choose which will use synchronous commits with the primary replica (HA) or asynchronous commits (DR). A secondary replica has a mirrored copy of all the databases in an availability group. SQL Server keeps the databases up to date either through synchronous or asynchronous writes. Synchronous writes mean a transaction does not commit on the primary replica and notification of the write s success is not returned to the client until the write is successful on a synchronous secondary replica. The fasted possible connection is important for synchronous writes between replicas not to cause noticeable delays for users. Longer transaction times also mean locks are in place longer, there is a greater likelihood of blocking locks and deadlocks, and high CPU usage in order to manage these resources. An asynchronous write means there is some latency between when the commits on the primary and secondary replicas. If a transaction commits on the primary replica but failure of the primary occurs before the transaction is written to the secondary replica, data can be lost. Any secondary replica that is online and is in a state of "synchronizing" on the dashboard is receiving asynchronous mirrored writes from the primary replica and can be used for manual failover. A state of synchronized, means that the replica is a synchronous secondary.
GoGrid SQL Server Business Continuity Design and Deployment 6 of 10 46. Automatic failover occurs only with synchronous secondary replicas. Choose a synchronous secondary replica to be in the automatic failover pair. 47. Create the availability group. Select one or more databases that will be members of the availability group. Add the secondary replicas. One can be a synchronous secondary, others must be asynchronous. The synchronous secondary replica can be an automatic failover partner. Choose any replicas that will be read only secondary replicas. This does not configure read-only routing. Read-only routing will have to be configured through Transact-SQL or PowerShell commands. Choose the Backup Preferences (see notes below). Create an Availability Group Listener. The availability group listener is a single network name that will automatically point to the node that is currently the primary replica. It should have an IP address on every subnet. It allows a client to not have to be configured with logic "if server X is not available, try server Y." SQL Server will register the listener names and IP address(es) with the domain DNS and as a cluster resource. IP addresses are required for each subnet on which a cluster node resides. As with the cluster access point, you can use one of the IPs in your block but you need to make note of it and not use it for other servers or you will an IP address conflict. The listener will route incoming client connections to the appropriate replica. Read-write client connections always connect to the single primary replica. Read-only connections can connect to a secondary replica if read-only routing is configured and the client supplies the correct properties in the connection string (see below). 48. The secondary replica in a second data center is typically configured with a passive license. As such, it is designed only for failover. It is a non-voting member of the quorum, configured with asynchronous commits and manual failover. 49. Creating the listener may fail in Windows Server 2008 R2 because Active Directory objects do not have necessary permissions. This occurred during the lab configuration. If necessary, the cluster Active Directory object must be granted full control permission over the Listener Active Directory object. Also grant Create Object permission to the Cluster object in the domain. 50. Specify full synchronization using the file share created to store database and transaction log backups used to sync replicas. If database data files or log files have a different path on any node, the Join Only synchronization preference must be used and the backups restored manually to each node.
GoGrid SQL Server Business Continuity Design and Deployment 7 of 10 51. The availability group validation may generate a warning message if the quorum file share is on the same subnet as nodes. Microsoft best practice is to have the file share in a separate datacenter on a separate network from any nodes. 52. Synchronize the replicas. Port 5022 is used by default for mirroring replica databases. 53. Clients can use secondary replicas for read-only connections, if configured for read-only routing and license conditions permit. Before a client can create a read-only session, the availability group replicas must first be configured for Readable secondary replicas. Then clients that use the appropriate properties in a.net connection string to establish read-only connections to a secondary replica. Read-only routing must be configured using Transact-SQL or PowerShell; there is no SQL Server Management Studio UI in SQL Server 2012 for configuring read-only routing. The following code runs on the primary replica, in this case SQL01. It sets up read-only routing to server SQL02 only. This code should be repeated for each secondary replica that is to be made a read-only secondary replica. When creating the read-only routing URL, specify port 1433 or whatever port is used for client connections. The last statement specifies the servers in the read-only routing list in order of the priority the listener should use in connecting (should the first server be offline, the listener tries the second one, etc.). -- :Connect SQL01 -- use the above statement when connecting using SQLCMD use master go ALTER AVAILABILITY GROUP AG1 MODIFY REPLICA ON N'SQL02' WITH (SECONDARY_ROLE (ALLOW_CONNECTIONS = READ_ONLY)) ALTER AVAILABILITY GROUP AG1 MODIFY REPLICA ON N'SQL02' WITH (SECONDARY_ROLE (READ_ONLY_ROUTING_URL = N'TCP://SQL02.Contoso.Local:1433')) ALTER AVAILABILITY GROUP AG1 MODIFY REPLICA ON N'SQL01' WITH (PRIMARY_ROLE (READ_ONLY_ROUTING_LIST= ('SQL02'))); 54. If read-only routing is configured for an availability group (it is optional), for a client to access a read-only secondary replica, a client must log on to the availability group listener with the database property of an availability group database in the connection string and the property ApplicationIntent=ReadOnly. In SQLCMD, with a listener network name of group1, a Windows authenticated connection, connecting to database DB1, the command line for connecting to a readonly secondary replica would be:
GoGrid SQL Server Business Continuity Design and Deployment 8 of 10 C:\>sqlcmd S group1 E d DB1 K ReadOnly 55. Backups from secondary replicas are allowed with limitations. Full database backups from secondary replicas are copy-only backups. Copy-only backups do not affect the log chain or reset differential backups. The only type of transaction log backup that can be restored after a copy-only full backup is restored is a backup with the NORECOVERY option. This backup requires that no users be working in the database. A full database backup that is intended to be restored to begin restoring a backup set must be executed from the primary replica. Transaction log backups can be done from primary replicas or secondary replicas as long as the secondary replica is in a synchronized (synchronous commits) or synchronizing (asynchronous commits) state. Differential backups cannot be performed from secondary replicas. To configure backups from secondary replicas, each node in the availability group is assigned a backup priority. All this does is set the backup preferences. It has nothing to do with how backups actually run. Backup jobs must be configured on each server. Each job runs as scheduled. If the replica on which the job runs is not the preferred replica, the job ends without running the backup. Only on the preferred replica does the job run. Database maintenance plans can be created on every SQL Server in an availability group that determine if the target SQL Server is the preferred replica and that do copy-only full database backups. Because of the problematic nature of restoring copy-only full database backups, full database backups that are intended as part of a backup set should be performed from the primary replica. 56. To restore a copy-only backup and transaction log backups to bring a database up to date with the most recent committed transaction, an online restore sequence using the BACKUP LOG...WITH NORECOVERY option. 57. Also see the white paper AlwaysOn Architecture Guide: Building a High Availability and Disaster Recovery Solution using AlwaysOn Availability Groups. Client Sessions Client sessions should use these.net connection string properties. 58. Always specify the Database=databaseName property in the connection string. 59. Specify the MultiSubnetFailover=True connection string property when connecting to a listener with IP addresses on multiple subnets. 60. Connect to the listener using both the database name and the ApplicationIntent=ReadOnly property to have the client session automatically routed to a read-only secondary replica when the availability group has been configured with read-only routing.
GoGrid SQL Server Business Continuity Design and Deployment 9 of 10 Recovery after Failure 61. If the primary node fails and a secondary node is in synchronous commit mode with automatic failover, run the Availability Group Dashboard on that node for a big picture view of the availability group. You can use SQL Management Studio from any client on the network and logon to the node in Object Explorer. 62. If there is no automatic failover secondary replica, manually failover using the Availability Group Failover Wizard in the Availability Group Dashboard in SQL Server Management Studio. This is possible as long as a cluster quorum exists and the cluster is still up. 63. If the cluster has failed but cluster services are running on a remaining target node, see if the availability group will failover to that node by running the Availability Group Failover Wizard. 64. If the Availability Group Failover Wizard failover fails because of a cluster error, start cluster services with the force quorum switch on that node. o o Log into Windows on that node. Run a command prompt. o From the command prompt, type the net start command. C:\>net start o If the cluster service is running, stop it. C:\>net stop clussvc o Start the cluster service with the force quorum switch. C:\>net start clussvc /fq o o After the cluster service starts, failover the availability group to this node using the Availability Group Failover Wizard. When the wizard completes, you should see this node as the primary node. 65. After other nodes in the cluster are back online, if the cluster service on the primary node was started using the force quorum switch, stop and restart cluster services on that node without the switch. C:\>net stop clussvc C:\>net start clussvc 66. Other than the one instance where you are starting the cluster service with the force quorum switch and then restarting it in normal mode, you should not use Failover Cluster Manager or PowerShell to set the cluster properties when an availability group is the cluster resource. Always interact with the availability group using SQL Server Management Studio and let SQL Server interact with cluster services. Manual Failover 67. Before a manual failover, first set the current primary node and secondary node that will be the new primary node in Synchronous Commit mode with Automatic Failover. This means designating a new synchronous commit and automatic failover partner if a different node currently has these
GoGrid SQL Server Business Continuity Design and Deployment 10 of 10 properties set, as the primary replica and only one secondary replica can be synchronous commit partners. In this case the current automatic failover and synchronous commit secondary replica would have to become an asynchronous secondary replica with manual failover. Putting the old primary and new primary replicas into synchronous commit mode ensures that no transactions committed on the old primary will be lost during the manual failover because they didn't make it to the new primary in time. You can setting the synchronous commit and automatic failover partner back to a different secondary node after the failover is complete. 68. Run the Availability Group Failover Wizard to complete the failover with a client connection through SQL Management Studio to the current primary node. 69. After the failover completes, from the new primary node run the dashboard to check the status of the availability group. 70. Change the properties of the availability group on the new primary node as necessary for the current synchronous/automatic secondary to become manual/asynchronous and for a new synchronous/automatic secondary.