Active Directory is a very robust application, as it should be for such a fundamental building block of a company’s IT infrastructure. But the architecture that makes it robust also makes it hard to understand. This lack of understanding often leads to assumptions in your recovery strategy that can leave your AD broken without a way to get back to normal.
When the AD engineering and operations team builds its recovery plan, the natural tendency is to first think of recovering domain controllers (DCs). This is a good first step, and it’s rooted in two very practical reasons. First, DCs have historically been the AD component that fails the most. Second, recovering and rebuilding servers is one of the oldest operational tasks in the operations run book, and so naturally you’d apply it to this situation.
But in Active Directory’s case, the team often doesn’t continue and plan for the other disaster type that Active Directory can encounter: logical corruption of AD data.
The web of replication
Part of what makes Active Directory hard to understand and troubleshoot operationally is the complexity of its logical data model. I’ve always looked at AD through what I think of as several pairs of glasses. First and most obvious is the network of physical DCs and their health. We tend to think of a forest’s DCs because they are the tangible assets of Active Directory. Next is my Sites glasses. With these, I view the site topology of the corporate network and how data is constantly replicated between DCs: sites of good connectivity (which may or may not have a DC in them), site links to connect the sites in a way that most fits the needs of users and applications that use AD sites, and fine-tuning these site links with link costing and DNS priority to favor some DCs over others.
But reality is more complicated than a site architecture diagram. If you look closely at replication with a tool such as ADREPLSTATUS or Repadmin, the true complexity of Active Directory is revealed. Every DC hosts several directory partitions containing important data, and different forest designs influence what partitions a DC hosts. Every one of those partitions connects with several other DCs elsewhere in the domain or forest. I see this as a skein of hundreds of pieces of multicolored yarn connecting pegs in a pegboard. Some colors of string go to just a few pegs, while other colors go to every peg on the board. It’s a marvel it works at all, let alone that it works so well.
But this mechanism works just as well for bad data as it does for good data. In case of logical data corruption, corrupted data will be replicated through all DC’s across the domain or forest. In cases where the logical corruption is severe (like disabling the DC’s account in AD through the standard Users and Computes console) a disaster recovery of AD needs to be carried out. The Microsoft restoration process includes more than 50 manual steps, making it slow, error- prone and insufficient for enterprise requirements. The complexity of the procedure is high enough for Microsoft to offer it as part of ADRES (AD Recovery Execution Service), a 5-day service engagement that focuses on AD recovery issues.
Many companies have implemented disaster recovery sites in Active Directory to provide failover DCs for users and service if the normal DC has failed. However, DR sites or the number of the DC’s you have will not help you in the case of logical corruption – they will actually make recovery more complicated.
That’s where you can either have a manual procedure to follow or an automated solution that will do it for you.