If you want to make an Active Directory administrator uncomfortable, ask them about their recovery plan.
When you ask this question, many AD admins will instead tell you about their object recovery plan. Some will describe their domain controller recovery procedures. But if you press further to ask if they’ve built a plan to deal with recovering a domain or forest, less than one in ten will admit to it. Far fewer will say they’ve actually tested it.
Why do so few organizations put together and test a disaster recovery plan for what is unquestionably one of a company’s most critical pieces of software plumbing?
There are two major reasons. First, AD is super reliable as a core infrastructure service. It’s a distributed application across multiple instances, and failures of one or more of these instances won’t prevent the service from continuing. In a properly maintained AD forest, domain or forest failure (without outside interference) is a rare occurrence.
Adding to this lack of disaster recovery planning is the fact that if you do have an Active Directory catastrophe – the loss of a domain or forest – recovery from it is a decidedly non-trivial task. Microsoft’s Planning For Active Directory Forest Recovery document is 47 pages long, and it’s far from an operational procedure. It’s a high-level procedure and set of guidelines that you must extensively customize for your environment. You can judge the seriousness of this document from the recognition that it assumes – even before you begin anything outlined in it you’ve been talking to Microsoft support to determine root cause and provide advice on whether you should be proceeding.
Don’t let these obstacles deter you. Let me be clear: If you don’t build, test, and maintain an Active Directory disaster recovery plan, you aren’t doing your job. If you still hesitate, just imagine being called into some plush conference room after a DR event, where your manager, and all the managers up to the CIO will ask you why you didn’t plan for this contingency. It’s the classic “resume-generating event”.
At a high level, domain or forest recovery has the following steps:
- Figure out what caused AD to break. If you don’t know why it broke, you won’t be able to do the next step and you’ll have no choice but to rebuild not recover your forest from scratch. Most companies never recover from this.
- Figure out a backup date AD can be recovered from (before it broke). Key point: If you don’t have a good backup strategy and process, you’re stuck in the rebuild scenario.
- Recover a “seed” domain or forest of one DC for each domain, on an isolated network.
- Connect the seed to the network and recover existing DCs into the seed.
There’s a lot of work required within this high level process. Lock yourself in a room and think deeply through it; imagine yourself working through this scenario in the real world. There are also lots of gotchas in the recovery process. For example, when you attach the seed forest to the corporate network, you must still keep it isolated from users while you’re building out its capacity. Otherwise the seed forest DCs will melt down and you won’t be able to add others, as every user in the company tries to logon to the network.
Once you’ve customized the procedure in the document, you must build a reasonably sophisticated test lab and test your procedure to refine it and uncover other problem areas you won’t have thought of in the planning process. This is a project that takes weeks of work. It’s enough of an undertaking that Microsoft offers it as part of ADRES (AD Recovery Execution Service), a 5-day service engagement that focuses on AD recovery issues.
So yes, Active Directory disaster recovery planning is a daunting task. But you must do it. If you’re an IT manager, ask your AD team if they have a recovery plan. When you catch them out, help fund or prioritize it for them! I like to think of AD DR the same way we think of insurance: You hope you’ll never need it, and most don’t. But if you need it, you really need it.