December 04, 2012

Active Directory Replication

Active Directory replication is key to the health and stability of an Active Directory environment. Without proper and timely replication, a domain will be unable to function effectively. Replication is the process of sending update information for data that has changed in the directory to other domain controllers. It is important to have a firm understanding of replication and how it takes place, both within the domain and in multiple-site environments.

There are three main elements or components that are replicated between domain controllers: the domain partition replica, the global catalog and the schema.

The domain partition replica is the Active Directory database of a domain. Each domain controller maintains a duplicate copy of its local domain partition replica. Domain controllers do not maintain copies of replicas from other domains. When an administrator makes a change to the domain, that change is replicated to all domain controllers immediately.

Each forest contains only a single global catalog. By default, the first domain controller installed into a forest is the global catalog server. The global catalog contains a partial replica of every object within each domain of the forest. The global catalog serves as a master index for the forest, which allows for easy and efficient searching for users, computers, resources and other objects. Any domain controller can be configured to act as a peer global catalog server. You should have at least two global catalog servers per domain and at least one per site. As changes are made to objects within the forest, the global catalog is updated. Once the global catalog is changed on one domain controller, it is replicated to all other domain controllers in the forest.

Every domain controller in a forest has a copy of the schema. Just as with changes to the Active Directory database (i.e., domain partition replica), any changes to the Active Directory schema are replicated to all other domain controllers in the forest. Fortunately, the schema is usually static so there is little replication traffic caused by schema changes.

Multi-master replication

Within Windows-based Active Directory domains, each domain controller is a peer server. Each domain controller has equal power and responsibility to support and maintain the Active Directory database. It is this database that is essential to the well-being and existence of the domain itself. This is such an important task that Microsoft elected to make it possible to deploy multi-redundant systems to support Active Directory by making each domain controller a peer.

Whenever a change occurs to any object within an Active Directory domain, that change is replicated automatically to all domain controllers within the domain. This process is called multi-master replication. Multi-master replication does not happen instantly across all servers simultaneously. Rather, it is a controlled process where each domain controller peer is updated and validated in a logically controlled procedure.

As an administrator, you have some control over how multi-master replication occurs. Most of your control is obtained through the use of sites. A site is a logical designation of domain controllers in a network that are all located within a defined physical area. In most cases, sites control traffic over high-expense low-bandwidth WAN links. When a domain exists on two or more sites, normal Active Directory replication between the domain controllers in different sites is terminated. Instead, a single server within each site, labeled as a bridgehead server, performs all replication communications. You can configure this bridgehead server for when replication is allowed to occur and how much traffic it can generate when performing replication.

You can use sites to control replication even if you do not employ WAN links on your network. Sites effectively give administrators control over how and when AD multi-master replication occurs within their network.

Active Directory replication topology design

One of the secrets to an efficient and error-free Active Directory infrastructure is a well-designed replication topology. While this can be easy to design in a simple network, a large, complex network presents a challenge. Designing the AD topology efficiently is to construct it so that it takes advantage of the strengths and minimizes the weaknesses of the network. In a complex network, you are likely to have a number of different link speeds connecting remote sites.

The best practices for Active Directory replication design include:
  1. Design the AD topology to take advantage of the network topology and link speeds.
  2. Define lower speed links with higher cost site links. The cost of the links reduces as you get to faster areas in the topology.
  3. Avoid "dead spots" -- all sites must connect to each other eventually. I have seen some topologies that left certain sites isolated because they didn't design the site links to connect them.
  4. Site links should only have two sites per link. The exception to this is the Core site link which can have more. Defining more than two sites per link can result in unpredictable results when a DC failure occurs.
  5. Diagram the overall flow of replication (like the figures here). You can use sophisticated features available in tools like HP OpenView (see the example in Figure 3) or Microsoft MOM, or you can simply draw it in a PowerPoint slide as I did in Figure 2. You'd be surprised at how many errors you will find by making a drawing of the topology.
  6. Don't define scheduling unless you really have a good reason, and then you should test it thoroughly. Since you can schedule replication over the site link as well as the connection object itself, and since the resultant replication schedule is a merge of the two, you can end up with a schedule that prohibits replication. You also define replication frequency, which further complicates it. For instance, if you schedule the site links to replicate Monday through Friday from 8 a.m. to 6 p.m., and then have some connection objects that only replicate Tuesday and Thursday from 6 p.m. to 10 p.m., those connection objects will never replicate. Unless you have a very slow or limited network (such as VPN links), you should avoid this level of manual intervention.
  7. Run the AD in Windows 2003 Forest mode. This means all DCs are at Windows Server 2003, and all domains are running in Windows 2003 mode. This takes advantage of the new spanning tree and compression algorithms available in Windows Server 2003, as well as other features that make replication much more efficient than were available in Windows 2000.
  8. Monitor the AD. Once you get it in place, monitor it. One of the easiest ways to monitor it, outside of using Microsoft or third-party tools, is using the Repadmin tool and its "Replsum" option: Repadmin /replsum /bydest /bysrc /sort:delta. This will provide a nice, neat table of all DCs in all domains in the forest, telling you how long it has been for outbound and inbound replication (i.e. where each DC appears as a source and destination). Watching this over several days will give you a chance to find any holes in the topology.
Troubleshooting Active Directory replication

Replication should occur automatically. When it doesn't, the best solution isn't just to force Active Directory replication, but to check out the topology. If the replication topology has become unstable or misconfigured, it needs to be corrected before initiating a manual replication procedure.

The Knowledge Consistency Checker (KCC) creates the replication topology used for intra-site replication automatically. Rather than creating a full mesh for replication, the KCC designs a topology where every DC has at least two replication partners and is no more than three hops away from any other DC. With such a topology, every DC can be fully updated with as little as three replication cycles.

The REPAdmin tool from the Windows Support Tools and Resource Kit can be used to check the topology. The command "repadmin /showreps" runs on a domain controller and produces a list of replication partners as designated by the KCC. To check the topology, verify that every DC lists at least two replication partners and that all named partners see each other as partners. For example, if Server A lists Server B and C as partners, then both Server B and C should list Server A in return as a partner. If you discover a problem or inconsistency in the topology, use the KCC to regenerate the topology.

Once you are sure the topology is correct, then and only then should you force Active Directory replication.

Debugging replication errors

The lion's share of Active Directory problems are to some degree caused by replication failures, and one of the most notorious replication errors is the Event ID 1311.

The first step in resolving this replication error is to determine the scope of the error. The easiest way to do this is with the Repadmin/Replsum command. This will give you a complete summary of all the DCs in the forest, including the relevant event ID if it is in an error state. The general form of the command is this:

Repadmin /Replsum /bysrc /bydest /sort:delta

Here is a sample output of this command. Note that there are four domain controllers failing replication. While the 1311 may not show up in the output of this command, it is common for it to be paired up with the 1722 event (which basically means no physical connectivity). Obviously, if there is no physical connectivity (which would mean there was a network failure), replication isn't going to happen. The first thing to do is to check the general health of the domain using the Repadmin /replsum command just described. You can also ping broken DCs by address and FQDN, and you can run NetDiag and DCDiag commands from the command line (with the /v switch on each). This will give you more details about the errors and perhaps related ones.

Note: The network connecting all the sites should be fully routed. Don't create a site link if there is no underlying network link to get between the sites in the site link.

Logical connectivity is a bit more difficult to diagnose. It means, bottom line, that something in the AD site topology configuration is wrong, creating a hole in the topology. This could be solved by one of the following actions: configuring a preferred bridgehead server, making sure all sites are defined in site links and making sure there is a complete mesh of sites in site links.

DNS also must be taken into account. Since Active Directory replication relies on DNS name resolution to find DCs to replicate with, if DNS is broken, it could cause the 1311 events to occur. The helpful thing here is that if DNS is the culprit, the 1311 event will have the phrase "DNS Lookup Failure" included in the description. If you see this phrase, then you absolutely, positively have a DNS problem that must be fixed.

When debugging 1311 events, you should get a scope of the entire forest to see which DCs are not replicating. You can do this easily using the Repadmin /Replsum command. Note that the loss of physical connectivity, an incomplete AD site topology or DNS failure usually cause these events, with an outside chance it will be an orphaned object (an object that connot be found in the directory tree). Usually, other events will accompany them, such as the 1722 (RPC Server Unavailable), or the event will contain a descriptive statement such as "DNS Lookup Failure." This is a critical event that must be resolved in order for Active Directory replication to function properly to all DCs.



No comments:

Post a Comment