Skip to main content

Weird issue with Exchange Global groups

Update: Ok, the below post contains erroneous conclusions. Clients are NOT using child domain GCs. I am becoming more convinced this is local to the Exchange server, as the child domain users are seeing their GCs (and can cycle with 'reconnect') with no issues. I observed a user here having the issue, and I noted that their Outlook 2007 Con.Status was not showing any directory items. However, when it started working for them after a while (15-45 minutes...?) they still did not see any directory items. Going to start from scratch in a new post.

Lesson: The danger of assuming something is happening when you don't know conclusively is that you send your entire investigation off in the wrong direction.


Sort of resolved. So we fix the issue of parent domain users having groups not resolve. What about child domain users trying to resolve child domain groups? If I remove the child domain GCs from the list, does that prevent child domain group lookups? Will investigate.

So Outlook definitely uses RR to get a GC. You can see this by CTRL+left click on the Outlook icon in the system tray, then choose 'connection status'. The server next to 'directory' is the GC in use. If you close Outlook, re-open, you'll see this change as it RRs among the available GCs. In my testing though, I could not get it to pick a child domain GC. ACTUALLY...if you just click the 'reconnect' button, it cycles through the GCs available to you. Nice!

Option1: Use AD Sites. Since the parent and child domains are geographically separate, issue could be mitigated to a degree. However, we'd have to figure out how the datacentre locations would come into play - I'll need to research more into how they actually work in this situation. In theory this should prevent Outlook's RR from picking GCs in the wrong site.

Option2: Remove child domain GCs from the 'Directory Access' tab. May impact child domain Outlook users - must verify this first. (Can parent domain GCs do lookups into the child domain? - Special trust required?)

Option3: There is a registry option to force Outlook to use a local GC for lookups. Not a good option.

---------------------

RESOLVED:

Ok, here's what's happened:
  • We had GC issues with Exchange a few weeks ago – this was because they were manually configured, and even then only one GC available to Exchange.
  • I changed that to ‘auto discover’ DCs. Resolved issue at the time.
  • List of available DCs/GCs now includes child domain's DCs.
  • When clients load Outlook, it round-robins for a GC to use.
  • Occasionally it RRs and gets a child domain DC.
  • Child domain DCs have no knowledge of groups in globalive.local.
  • Email to group fails at ‘Categorizer’ stage. (Categorizer resolves group names into individual email addresses.)
Resolution: Change 'Directory Access' tabs back to manual and add all parent domain DCs to the list.

Lesson? Do NOT set 'Directory Access' to automatically discover when you have child domains.

Boomshakalaka. Feels good to figure it out!


-------------------------------
Or at least, that's what I think is happening here.

  • Environment has a parent domain and child domain.
  • Exchange server is for both domains.
  • Almost all DCs across both domains are GCs.
  • Group in question is a global security group with members only from the parent domain (and no visible stale membership).
  • 'Directory Access' tab in Exchange server properties lists all GCs in all domains (set to automatically discover GCs).

We occasionally have this issue:
  • User sends an email to a group from their outlook.
  • Email never arrives to the group.
  • Mail tracking indicates the mail has arrived at the server.
  • Mail tracking last event: SMTP: Message Submitted to Categorizer
Mail never leaves the server.

Some googling revealed that the 'Categorizer' is responsible for talking to a GC and getting group membership information.

What I think is happening stems from the list of GCs...the Categorizer goes to look up group membership and happens to choose one of the child domain GCs. That GC would not have knowledge of the group membership of the parent domain, and so the mail gets stuck and never sent.

There is one alternative: Orphaned user objects. However, the group in question is clear of old/disabled users, and there are no 'mystery GUIDs'.

At least that's all I can come up with. I've enabled the LDAP/directory logging, so we'll see what that turns up.

More info on the categorizer: http://searchexchange.techtarget.com/tutorial/Part-3-How-the-SMTP-categorizer-works-in-Exchange-Server-and-Active-Directory

And this nice tidbit:

"If your Exchange Server organization is having mail flow issues, use message tracking to see where the process is breaking down. If messages are stopping at the categorizer, you should begin troubleshooting the problem by looking at directory-access issues."

And for the random issues, the messages in question are being stopped at the categorizer.

With logging turned on for MSExchangeTransport, I am seeing categorizer 6020 events (...using LDAP server Host: "DCname") - indicating that the DSAccess service is polling the AD topology successfully. There are events for each DC in the 'Directory Access' tab. This indicates that there are no issues with discovering DCs. (this can also be tested by just viewing the 'Directory Access' tab - it should not take more than a few seconds to resolve the list. If it sits there for more than 30-40 seconds, it's probably timing out on one or more DCs.)

Since the error is difficult to troubleshoot (very rare, spread out), the only logical next step is to try and prove the theory - what GC does the Categorizer use? Any order?

Nice technet article on how Exchange 2003 uses LDAP connections: http://technet.microsoft.com/en-us/library/aa996247%28EXCHG.65%29.aspx#LDAPConnectionLoadBalancingAnd

Ok, we have an answer: http://technet.microsoft.com/en-us/library/cc751317.aspx

This doc explains things in detail. DSProxy is what actually talks to the GCs when doing lookups, and it uses load balancing. (emphasis mine)

"Although NSPI is a very efficient process, the DSProxy process uses a load balancing mechanism to ensure that client requests are divided equally among all available global catalog servers.

When a MAPI client contacts NSPI Proxy, the IP address of the requesting client is hashed against the number of available global catalog servers. DSProxy uses the result to either proxy or refer the client to one of the global catalog servers. This load balancing method enables the client to contact the same global catalog server, thus ensuring consistency. The Directory Service Referral interface (RFRI) uses a different load balancing mechanism; when a client connects to RFRI, global catalog servers are returned in round robin fashion."

It appears that Outlook 2000 and up use RFR to connect (aka use Round Robin for GC access). GC server refresh is done at startup of the client.


Summary: Ok, so that's good enough for me - I can test this next time - if a client is finding that emails to a certain group are failing, I can ask them to close Outlook, re-open, and try re-sending the mail. I will update when this comes up again (if it does...we're upgrading to 2010 shortly).


Update: Yup! 99.9% sure this was the issue. I spoke to the two people who were having the issue, and got one to send a test (it failed), and then close Outlook and test again. It worked this time! He got the other guy to test (failed), then close/re-open Outlook, and voila, worked. Cool!

Comments

Popular posts from this blog

DFSR - eventid 4312 - replication just won't work

This warning isn't documented that well on the googles, so here's some google fodder:


You are trying to set up replication for a DFS folder (no existing replication)Source server is 2008R2, 'branch office' server is 2012R2 (I'm moving all our infra to 2012R2)You have no issues getting replication configuredYou see the DFSR folders get created on the other end, but nothing stagesFinally you get EventID 4312:
The DFS Replication service failed to get folder information when walking the file system on a journal wrap or loss recovery due to repeated sharing violations encountered on a folder. The service cannot replicate the folder and files in that folder until the sharing violation is resolved.  Additional Information:  Folder: F:\Users$\user.name\Desktop\Random Folder Name\  Replicated Folder Root: F:\Users$  File ID: {00000000-0000-0000-0000-000000000000}-v0  Replicated Folder Name: Users  Replicated Folder ID: 33F0449D-5E67-4DA1-99AC-681B5BACC7E5  Replication Group…

Fixing duplicate SPNs (service principal name)

This is a pretty handy thing to know:

SPNs are used when a specific service/daemon uses Kerberos to authenticate against AD. They map a specific service, port, and object together with this convention: class/host:port/name

If you use a computer object to auth (such as local service):
MSSQLSVC/tor-sql-01.domain.local:1433

If you use a user object to auth (such as a service account, or admin account):
MSSQLSVC/username:1433

Why do we care about duplicate SPNs? If you have two entries trying to auth using the same Kerberos ticket (I think that's right...), they will conflict, and cause errors and service failures.

To check for duplicate SPNs:
The command "setspn.exe -X

C:\Windows\system32>setspn -X
Processing entry 7
MSSQLSvc/server1.company.local:1433 is registered on these accounts:
CN=SERVER1,OU=servers,OU=resources,DC=company,DC=local
CN=SQL Admin,OU=service accounts,OU=resources,DC=company,DC=local

found 1 groups of duplicate SPNs. (truncated/sanitized)

Note that y…

Logstash to Nagios - alerting based on Windows Event ID

This took way longer than it should have to get going...so here's a config and brain dump...

Why?
You want to have a central place to analyze Windows Event/IIS/local application logs, alert off specific events, alert off specific situations.  You don't have the budget for a boxed solution.  You want pretty graphs.  You don't particularly care about individual server states.  (see rationale below - although you certainly have all the tools here to care, I haven't provided that configuration)

How?
ELK stack, OMD, NXlog agent, and Rsyslog.  The premise here is as follows:

Event generated on server into EventLogNXlog ships to Logstash inputLogstash filter adds fields and tags to specified eventsLogstash output sends to a passive Nagios service via the Nagios NSCA outputThe passive service on Nagios (Check_MK c/o OMD) does its thing w. alerting
OMD
Open Monitoring Distribution, but the real point here is Check_MK (IIRC Icinga uses this...).  It makes Nagios easy to use and main…