Exchange migration

4:00am (11:30pm start)
Not going well so far.

Problem: The US exchange server refuses to route emails to the CAN exchange server.

Here's what we know:
- They are in the same domain.
- They can ping each other using IP and DNS.
- They can telnet to each other on port 25.
- They can send a message via SMTP commands across the VPN connection that arrives in a recipient's mailbox successfully, verified from both sides.
- Active directory replication has been troublesome, but seems to work, for the most part.
- Tracert functions correctly - three hops, one local, one *s, one destination - so the VPN is functioning correctly in that respect.
- A share can be mapped from one server to the other.
- AD domains and trusts becomes unresponsive when opened - RPC issues
- dcpromo fails at replication (times out) - RPC issue
- SCCM 2007 installation completes with some failures - timouts - RPC issue

11:30am
We call Microsoft and are assured contact soon.

2:30pm
Unfortunately Microsoft has been lax to get back to us with an appropriate contact. The first guy didn't really have a clue, and then he said it would be an additional 1-2 hours for someone else. After 2.5 hours, I called them back and asked for someone local. No time for communication issues today.

3:00pm
Yeesh. Said that someone should be back to me in 20 minutes...that was 30 minutes ago. It's now 2:50pm...I'd kinda like to enjoy some of Canada Day.

8:16PM
Tech support guy has gone on lunch for 15 minutes.
They made some registry changes, after hours of "'replicate now' ... failed". That got kinda boring. They were also using this a lot: repadmin /Syncall /e /P

They tried this a bunch as well: netdom /resetpwd /server:DC1 /userd:domain\admin /password:*

HKLM\System\CurrentControlSet\Services\TCPIP\Parameters
- new DWORD - DisableTaskOffload - Decimal - 1
- new DWORD - MaxUserPort - Decimal - 65535

HKLM\System\CurrentControlSet\Services\DNS\Parameters
- new DWORD - SocketPoolSize - Decimal - 2500

HKLM\System\CurrentControlSet\Services\LSA\Kerberos\Parameters
- new DWORD - MaxTokenSize - Decimal - 65535
- new DWORD - MaxPacketSize - Decimal - 1

They then had me restart each DC sequentially, allowing full boot-up before proceeding.

I launched dssite.msc on each DC, and things were looking better, but for the US DCs, the CAN DC1 (PDC) NTDS settings are blank. Not sure

12:00am
I finally ask why this is taking so long with no real results, and speak to a 2nd level technician, who is in fact a 1st level technician, and just scripts the front-line people to do his bidding. Essentially this is a known flaw with Server 2008, and they are working on a patch. He tells me that if we set up two 2003 DCs, one on each site, then the issue will be resolved. I tell him to call me back in 15 minutes.

12:40am
Still no word. Calling the tech line. Again.
Guy was very apologetic, but he had no control over where the tech support came from. Said it was all coming from India at this point. Lovely. He said he would escalate it as my case was priority one at the moment.

12:55am
No word yet. In the mean time, I've just started working on it again on my own, but looks like replication is failing anyways. Right off the bat I got an RPC error, so I changed the DNS servers to the YYZ-DCs, and I managed to get to the replication part of the dcpromo for the JFK 2003 DC. It is, however, taking a long time at this point, so pretty sure it's timing out. That normally takes about 5 minutes...so...we wait.

5:35am
So tired.

This guy I just spoke to actually knows what he's doing, and is methodical and takes notes.

He'll be emailing me the results of his analysis in a while, so I can at least get a few hours sleep before the day starts. He has come to the conclusion that is could be:
- Packet loss causing network timeouts, specifically RPC packets being lost
- Symantec Endpoint
- bad switchports
- corrupt vnic, said we could try adding a second one, giving it a new IP, disabling the old one, and updating any relevant DNS records.


Argh. We'll see.

Comments

Popular posts from this blog

Learning through failure - a keyboard creation journey

Learning Opportunities - Watching/listening list

DFSR - eventid 4312 - replication just won't work