Tuesday, April 13, 2010

Exchange 2010 Site Disaster Recovery on a dime! Part2: Navigating the Failover process

In Part1 of this series I explained how to build a low cost site or datacenter disaster recovery solution using Microsoft Exchange’s new DAG feature. In this article, I will endeavor to explain what manual steps are required to failover to your other site in the event of a disaster.

First of all let’s discuss what types of problem can occur. There are a variety of problems that can happen ranging from simple disk failure to a tornado smashing the datacenter in the primary site. In this article, I would like to address how you would manually activate your backup exchange server if your primary server’s mother board or disk failed. Next, I will outline the steps to take if you experience the dreaded total site failure and then I will finally conclude with how to fail back to your primary site when everything returns to normal.

OK, so how do we recover from for example a motherboard failure?
If you find yourself in this situation, you can be sure that your primary Exchange server will be offline and not functional. The good news is that in this situation all your other core infrastructure will be up and working, including critical items like your domain controllers and DNS servers.

The first thing you will notice is that your Outlook clients will still try to connect to the original MAPI endpoint (RPC Client Access Service located on CAS). To quickly rectify this situation, simply just change the A record in DNS for the ClientAccessArray to the IP of CAS in the DR site. The Time To Live on this record should be a couple of minutes making the change to a new IP as fast as possible. Another thing you also should consider is the time it takes for DNS replication/updates to propagate throughout the network.

Next it will be time to get the databases up and running on your DR server.

First verify that all Exchange services are running on the DR server. If the services have been turned off this could cause other problems with transaction log replication.

The most simple step is to move all active databases from the primary site to be activated on the DR site. The following command should be run on a server in the DR site, most likely from the Exchange server.

First remove the activation block on mailboxes in the DR site

Resume-MailboxDatabaseCopy 'mailbox database name\FQDNofaServerinDRSite

Perform this step on every mailbox database you want to activate. There is a chance that databases will mount automatically when resuming mailboxdatabasescopies. You can verify status by running Get-MailboxDatabaseCopyStatus on Exchange server in DR site.

Get-MailboxDatabaseCopyStatus -server FQDNofaServerinDRSite | fl Name, Status, ActivationSuspended, ContentIndexState, Activecopy

If databases are mounted and the ActiveCopy is True, you are done with the activation and outlook should now be able to connect and start receiving and sending mail internally. Next reconfigure services and applications to make Exchange reachable from Internet with SMTP, Outlook anywhere, OWA, Active Sync etc. If you have ISA or other reverseproxy server, reconfigure it to the server in the DR site instead of the server in the primary site. Other services that might need to be reconfigured are autodiscover and InternalUrl in several IIS virtual directories.

If mailboxes don’t mount correctly, you can manually run the following command:

Move-ActiveMailboxDatabase –Server FQDNofaServerinPrimarySite –ActivateOnServer FQDNofaServerinDRSite

Depending how Windows and Exchange managed to handle the crash you might encounter some errors, making the activation a little more difficult. Things that might happen range from the index is not up to date on the DR server or all transaction log files have not been copied to the DR server. The solution is to specify some extra parameters on the Move-ActiveMailboxDatabase command.

For example, -SkipClientExperienceChecks is good to use when index is not up to date.

If you have not configured AutoDatabaseMountDial on the mailbox server, by default it is set to lossless and there is always a chance that replication have not copied all transaction log files to DR server, then you have to use the –MountDialOverride with a parameter such as BestAvailability or GoodAvailability.

Other parameters that might be needed are –SkipLagChecks or –SkipHealthChecks.
You might have to use several parameters together to get databases up and running.

Move-ActiveMailboxDatabase –Server FQDNofaServerinPrimarySite –ActivateOnServer FQDNofaServerinDRSite –MountDialOverride:BestAvailability –SkipLagChecks –SkipHealthChecks -SkipClientExperienceChecks

More information about Move-ActiveMailboxDatatabase is found on Technet. http://technet.microsoft.com/en-us/library/dd298068.aspx

When you have replaced the motherboard on Exchange server in the primary site and replication starts going from the DR site to primary site, you’re good and it’s time to plan the switchover to the primarysite. This is done with the same step as above. Plan the switchover to a time during off hours since the switchover will take a couple of minutes due to the necessary DNS updates, AD replication and time it takes to run the commands above.

Finally, you should run the Suspend-MailboxDatabaseCopy again to disable automatic activation of databases in DR site.

Suspend-MailboxDatabaseCopy -Identity 'Mailbox Database 2036433681\FQDNofServerInDRSite' -ActivationOnly –Verbose

This last step is needed because activation is reset when you do a switchover between servers. Be sure to remember to do this for every mailbox database on your servers.

If you can’t get things started on Exchange in the primary site due to problems with corrupt database or transaction log files, you might have to reseed files from the server in DR site. Use the Update-StorageGroupCopy and possibly with the –DeleteExistingFiles parameter.

Recover from a disk failure is pretty much the same as above but it only involve databases and transaction log files located on the faulty disk.
Another cool thing is that you can even test a database switchover in production. To do this, first create a database in the primary site and make a copy in the DR site the same way all the other databases were created. Next create a mailbox in the test database, logon and send some test messages back and forth. Activate the test database on the DR server, edit the hosts file with the FQDN of the CASarrayname and the IP of Exchange in DR site and start outlook again. You should now be able to connect with Outlook to the DR server and use outlook the normal way with disturbing any other users.

Recover from a disaster in the primary site.
This is more problematic scenario, but the steps are basically the same as above. The slightly more complex steps are caused by the fact that you don’t have any servers or network connectivity in the primary site and that your cluster will not have access to its quorum, and as a result it will be in a failed state.

How do you solve this problem?
First you need to make your cluster working.
In the DR site, stop the failover cluster service if started and the start it again with the forcequorum switch.

net start clussvc /forcequorum

The next step is to active all databases on the DR server. This is done in the Move-ActiveMailboxdatabase command the same way as before.

You may also have to manually mount the databases.

With a complete site failure in the production site you most likely need to live with the DR site for a while which calls for more actions than just getting your Exchange server up and running. You also need to get traffic to and from Internet flowing, both mailflow and user access to Exchange. Autodiscover is your friend to update configuration in outlook, so make sure you have configure all URL’s correct.

So in the whole there is a lot more to reconfigure than just Exchange to do a site failover.

http://technet.microsoft.com/en-us/library/dd351049.aspx

How do you fail back to your primary site after the disaster?

We have forced quorum on our cluster and if we restart the cluster service or reboot the server, the cluster service will fail to get quorum. This is important when servers go online in the primary datacenter since we don’t want to have a forced quorum in the secondary site when servers startup in the primary site.

If everything wasn’t that bad and we could simply power up everything in our primary site, replication should start working again.
But you have to do some things like, reconfigure your File Share Witness, restart cluster service on secondary Exchange server, and basically all steps we did to move everything to secondary site but now change everything to point to our primary site again. But don’t rush things here, let Active Directory get to a stable state first and then slowly move things back to normal.

Depending on what state servers are in and what happened you may not want to start Exchange in primary site, but remove it from DAG and rebuild Exchange, join it to DAG etc.

As you have probably noticed, there are lots of variables and therefore it is not easy task to write a step by step guide on what to do for every situation. It would be recommended to write out the basic steps and your configuration information to make the transition easier when you are dealing with the stress of the situation. The best tip I can give to all of you is to learn how things work and play with the various scenarios in a lab. The experience you gain from this will be your best friend when the unexpected happens in real life.