Lesson: The danger of assuming something is happening when you don't know conclusively is that you send your entire investigation off in the wrong direction.
Sort of resolved. So we fix the issue of parent domain users having groups not resolve. What about child domain users trying to resolve child domain groups? If I remove the child domain GCs from the list, does that prevent child domain group lookups? Will investigate.
So Outlook definitely uses RR to get a GC. You can see this by CTRL+left click on the Outlook icon in the system tray, then choose 'connection status'. The server next to 'directory' is the GC in use. If you close Outlook, re-open, you'll see this change as it RRs among the available GCs. In my testing though, I could not get it to pick a child domain GC. ACTUALLY...if you just click the 'reconnect' button, it cycles through the GCs available to you. Nice!
Option1: Use AD Sites. Since the parent and child domains are geographically separate, issue could be mitigated to a degree. However, we'd have to figure out how the datacentre locations would come into play - I'll need to research more into how they actually work in this situation. In theory this should prevent Outlook's RR from picking GCs in the wrong site.
Option2: Remove child domain GCs from the 'Directory Access' tab. May impact child domain Outlook users - must verify this first. (Can parent domain GCs do lookups into the child domain? - Special trust required?)
Option3: There is a registry option to force Outlook to use a local GC for lookups. Not a good option.
Ok, here's what's happened:
- We had GC issues with Exchange a few weeks ago – this was because they were manually configured, and even then only one GC available to Exchange.
- I changed that to ‘auto discover’ DCs. Resolved issue at the time.
- List of available DCs/GCs now includes child domain's DCs.
- When clients load Outlook, it round-robins for a GC to use.
- Occasionally it RRs and gets a child domain DC.
- Child domain DCs have no knowledge of groups in globalive.local.
- Email to group fails at ‘Categorizer’ stage. (Categorizer resolves group names into individual email addresses.)
Lesson? Do NOT set 'Directory Access' to automatically discover when you have child domains.
Boomshakalaka. Feels good to figure it out!
Or at least, that's what I think is happening here.
- Environment has a parent domain and child domain.
- Exchange server is for both domains.
- Almost all DCs across both domains are GCs.
- Group in question is a global security group with members only from the parent domain (and no visible stale membership).
- 'Directory Access' tab in Exchange server properties lists all GCs in all domains (set to automatically discover GCs).
We occasionally have this issue:
- User sends an email to a group from their outlook.
- Email never arrives to the group.
- Mail tracking indicates the mail has arrived at the server.
- Mail tracking last event: SMTP: Message Submitted to Categorizer
Some googling revealed that the 'Categorizer' is responsible for talking to a GC and getting group membership information.
What I think is happening stems from the list of GCs...the Categorizer goes to look up group membership and happens to choose one of the child domain GCs. That GC would not have knowledge of the group membership of the parent domain, and so the mail gets stuck and never sent.
There is one alternative: Orphaned user objects. However, the group in question is clear of old/disabled users, and there are no 'mystery GUIDs'.
At least that's all I can come up with. I've enabled the LDAP/directory logging, so we'll see what that turns up.
More info on the categorizer: http://searchexchange.techtarget.com/tutorial/Part-3-How-the-SMTP-categorizer-works-in-Exchange-Server-and-Active-Directory
And this nice tidbit:
"If your Exchange Server organization is having mail flow issues, use message tracking to see where the process is breaking down. If messages are stopping at the categorizer, you should begin troubleshooting the problem by looking at directory-access issues."
And for the random issues, the messages in question are being stopped at the categorizer.
With logging turned on for MSExchangeTransport, I am seeing categorizer 6020 events (...using LDAP server Host: "DCname") - indicating that the DSAccess service is polling the AD topology successfully. There are events for each DC in the 'Directory Access' tab. This indicates that there are no issues with discovering DCs. (this can also be tested by just viewing the 'Directory Access' tab - it should not take more than a few seconds to resolve the list. If it sits there for more than 30-40 seconds, it's probably timing out on one or more DCs.)
Since the error is difficult to troubleshoot (very rare, spread out), the only logical next step is to try and prove the theory - what GC does the Categorizer use? Any order?
Nice technet article on how Exchange 2003 uses LDAP connections: http://technet.microsoft.com/en-us/library/aa996247%28EXCHG.65%29.aspx#LDAPConnectionLoadBalancingAnd
Ok, we have an answer: http://technet.microsoft.com/en-us/library/cc751317.aspx
This doc explains things in detail. DSProxy is what actually talks to the GCs when doing lookups, and it uses load balancing. (emphasis mine)
"Although NSPI is a very efficient process, the DSProxy process uses a load balancing mechanism to ensure that client requests are divided equally among all available global catalog servers.
When a MAPI client contacts NSPI Proxy, the IP address of the requesting client is hashed against the number of available global catalog servers. DSProxy uses the result to either proxy or refer the client to one of the global catalog servers. This load balancing method enables the client to contact the same global catalog server, thus ensuring consistency. The Directory Service Referral interface (RFRI) uses a different load balancing mechanism; when a client connects to RFRI, global catalog servers are returned in round robin fashion."
It appears that Outlook 2000 and up use RFR to connect (aka use Round Robin for GC access). GC server refresh is done at startup of the client.
Summary: Ok, so that's good enough for me - I can test this next time - if a client is finding that emails to a certain group are failing, I can ask them to close Outlook, re-open, and try re-sending the mail. I will update when this comes up again (if it does...we're upgrading to 2010 shortly).
Update: Yup! 99.9% sure this was the issue. I spoke to the two people who were having the issue, and got one to send a test (it failed), and then close Outlook and test again. It worked this time! He got the other guy to test (failed), then close/re-open Outlook, and voila, worked. Cool!