Skip to main content

Exchange migration

4:00am (11:30pm start)
Not going well so far.

Problem: The US exchange server refuses to route emails to the CAN exchange server.

Here's what we know:
- They are in the same domain.
- They can ping each other using IP and DNS.
- They can telnet to each other on port 25.
- They can send a message via SMTP commands across the VPN connection that arrives in a recipient's mailbox successfully, verified from both sides.
- Active directory replication has been troublesome, but seems to work, for the most part.
- Tracert functions correctly - three hops, one local, one *s, one destination - so the VPN is functioning correctly in that respect.
- A share can be mapped from one server to the other.
- AD domains and trusts becomes unresponsive when opened - RPC issues
- dcpromo fails at replication (times out) - RPC issue
- SCCM 2007 installation completes with some failures - timouts - RPC issue

11:30am
We call Microsoft and are assured contact soon.

2:30pm
Unfortunately Microsoft has been lax to get back to us with an appropriate contact. The first guy didn't really have a clue, and then he said it would be an additional 1-2 hours for someone else. After 2.5 hours, I called them back and asked for someone local. No time for communication issues today.

3:00pm
Yeesh. Said that someone should be back to me in 20 minutes...that was 30 minutes ago. It's now 2:50pm...I'd kinda like to enjoy some of Canada Day.

8:16PM
Tech support guy has gone on lunch for 15 minutes.
They made some registry changes, after hours of "'replicate now' ... failed". That got kinda boring. They were also using this a lot: repadmin /Syncall /e /P

They tried this a bunch as well: netdom /resetpwd /server:DC1 /userd:domain\admin /password:*

HKLM\System\CurrentControlSet\Services\TCPIP\Parameters
- new DWORD - DisableTaskOffload - Decimal - 1
- new DWORD - MaxUserPort - Decimal - 65535

HKLM\System\CurrentControlSet\Services\DNS\Parameters
- new DWORD - SocketPoolSize - Decimal - 2500

HKLM\System\CurrentControlSet\Services\LSA\Kerberos\Parameters
- new DWORD - MaxTokenSize - Decimal - 65535
- new DWORD - MaxPacketSize - Decimal - 1

They then had me restart each DC sequentially, allowing full boot-up before proceeding.

I launched dssite.msc on each DC, and things were looking better, but for the US DCs, the CAN DC1 (PDC) NTDS settings are blank. Not sure

12:00am
I finally ask why this is taking so long with no real results, and speak to a 2nd level technician, who is in fact a 1st level technician, and just scripts the front-line people to do his bidding. Essentially this is a known flaw with Server 2008, and they are working on a patch. He tells me that if we set up two 2003 DCs, one on each site, then the issue will be resolved. I tell him to call me back in 15 minutes.

12:40am
Still no word. Calling the tech line. Again.
Guy was very apologetic, but he had no control over where the tech support came from. Said it was all coming from India at this point. Lovely. He said he would escalate it as my case was priority one at the moment.

12:55am
No word yet. In the mean time, I've just started working on it again on my own, but looks like replication is failing anyways. Right off the bat I got an RPC error, so I changed the DNS servers to the YYZ-DCs, and I managed to get to the replication part of the dcpromo for the JFK 2003 DC. It is, however, taking a long time at this point, so pretty sure it's timing out. That normally takes about 5 minutes...so...we wait.

5:35am
So tired.

This guy I just spoke to actually knows what he's doing, and is methodical and takes notes.

He'll be emailing me the results of his analysis in a while, so I can at least get a few hours sleep before the day starts. He has come to the conclusion that is could be:
- Packet loss causing network timeouts, specifically RPC packets being lost
- Symantec Endpoint
- bad switchports
- corrupt vnic, said we could try adding a second one, giving it a new IP, disabling the old one, and updating any relevant DNS records.


Argh. We'll see.

Comments

  1. Choose Ilabs Technology Solutions for Exchange Migration services as it provides 360 degree approach to providing migration services, Zero impact on the operational and business activities, 70% cost saving.

    ReplyDelete

Post a Comment

Popular posts from this blog

DFSR - eventid 4312 - replication just won't work

This warning isn't documented that well on the googles, so here's some google fodder:


You are trying to set up replication for a DFS folder (no existing replication)Source server is 2008R2, 'branch office' server is 2012R2 (I'm moving all our infra to 2012R2)You have no issues getting replication configuredYou see the DFSR folders get created on the other end, but nothing stagesFinally you get EventID 4312:
The DFS Replication service failed to get folder information when walking the file system on a journal wrap or loss recovery due to repeated sharing violations encountered on a folder. The service cannot replicate the folder and files in that folder until the sharing violation is resolved.  Additional Information:  Folder: F:\Users$\user.name\Desktop\Random Folder Name\  Replicated Folder Root: F:\Users$  File ID: {00000000-0000-0000-0000-000000000000}-v0  Replicated Folder Name: Users  Replicated Folder ID: 33F0449D-5E67-4DA1-99AC-681B5BACC7E5  Replication Group…

Fixing duplicate SPNs (service principal name)

This is a pretty handy thing to know:

SPNs are used when a specific service/daemon uses Kerberos to authenticate against AD. They map a specific service, port, and object together with this convention: class/host:port/name

If you use a computer object to auth (such as local service):
MSSQLSVC/tor-sql-01.domain.local:1433

If you use a user object to auth (such as a service account, or admin account):
MSSQLSVC/username:1433

Why do we care about duplicate SPNs? If you have two entries trying to auth using the same Kerberos ticket (I think that's right...), they will conflict, and cause errors and service failures.

To check for duplicate SPNs:
The command "setspn.exe -X

C:\Windows\system32>setspn -X
Processing entry 7
MSSQLSvc/server1.company.local:1433 is registered on these accounts:
CN=SERVER1,OU=servers,OU=resources,DC=company,DC=local
CN=SQL Admin,OU=service accounts,OU=resources,DC=company,DC=local

found 1 groups of duplicate SPNs. (truncated/sanitized)

Note that y…

Logstash to Nagios - alerting based on Windows Event ID

This took way longer than it should have to get going...so here's a config and brain dump...

Why?
You want to have a central place to analyze Windows Event/IIS/local application logs, alert off specific events, alert off specific situations.  You don't have the budget for a boxed solution.  You want pretty graphs.  You don't particularly care about individual server states.  (see rationale below - although you certainly have all the tools here to care, I haven't provided that configuration)

How?
ELK stack, OMD, NXlog agent, and Rsyslog.  The premise here is as follows:

Event generated on server into EventLogNXlog ships to Logstash inputLogstash filter adds fields and tags to specified eventsLogstash output sends to a passive Nagios service via the Nagios NSCA outputThe passive service on Nagios (Check_MK c/o OMD) does its thing w. alerting
OMD
Open Monitoring Distribution, but the real point here is Check_MK (IIRC Icinga uses this...).  It makes Nagios easy to use and main…