More to report...

While troubleshooting, I've had the RDP connection drop out on me a few times, while the VM console is still relatively active. After this happening the last time I tried pinging it, and discovered that it was unreachable - timed out. I'm not sure how to get a ping log done with a timestamp without scripting, so for now we're going to try another VMnic on a different adapter, but same network. We'll give it the same IP and disable the old one.

Ok, that's done - new vmnic on a different adapter, but same IP setup. I'm running another ping log. The last one had 5 instances of 10-20 request timed outs over three hours...should know more in an hour or two.

Well it's still dropping the connection. So, we need to narrow the problem down to: VMware or the guest OS. I now have pinglogs going to all the guests on that ESX server, so if I see timeouts on all the VMs, then we know it's VMware, and if no timeouts occur, then we know it's SQL4 being silly.

More to come later...

Well, weird answer. All the guests on that host are getting timeouts, but the host itself doesn't get timeouts. I did a few other guests/hosts on other machines to verify this, and ONLY the guests on that host are having issues. Very strange.

Another oddity was an error message saying the SQL4 server was getting a DOS attack from our integration server, I find it strange that an event like that would not cause physical problems, but I guess in a virtual environment things behave a bit differently.

So, Dan is coming over and we're going to reinstall ESX - shouldn't take too long - and then get the four VMs on that ESX host back up and running, and I'll set up the ping logs once again.

I found this on technet: http://www.microsoft.com/technet/scriptcenter/resources/qanda/aug06/hey0817.mspx

I'm using it for the ping logs now - clunky, until I can figure out how to ping multiple targets to separate log files.

UPDATE!!!!

The issue completely disappeared, which is great! Except that I don't know WHY it disappeared. Very very odd. No more timed outs on any of the afflicted servers. Our SQL box is back to 100%. MESSED UP, MAN. Messed.

Comments

Popular posts from this blog

DFSR - eventid 4312 - replication just won't work

Fixing duplicate SPNs (service principal name)

Logstash to Nagios - alerting based on Windows Event ID