A real doozie: Changing VM port group drops networking for 10 minutes

Update:  Support had nothing, so we are forced to chalk it up to the vDS stuff being wonky.  If anyone runs across this, would be interesting to hear their solution.

What follows is the ticket we've submitted (with a few additions that might help the general public).  This being the first time I've really had the chance to play around with and design my own VDS solution, mistakes were made.  Long story short I wanted to do X initially, switches didn't like that and our network guy recommended against it, so we did Y.  Turns out X does work, but we get weird un-sanctioned behaviour.  (X being 'let vSphere handle networking, just give me VLAN trunk switchports', Y being 'let's use LAGs/LACP/MLAG/etc'.)

I will update once we have a solution...

March12 - Quick update, I moved the rest of the VMs off the old VDS, roughly 25% of them had the same failure issue.  We're still waiting for something concrete from Extreme - I suspect with our failure scenario now 'non-reproducible', we're out of luck.

-----------------------
We are running into an issue while changing VMware port groups between virtual distributed switches.  Some VMs work, some VMs go into a state where they can communicate with other VMs in that new port group, but cannot ping the default gateway.

Environment overview:

  • vSphere 5.5, VDSes are 5.5 (not upgraded)
  • Switches are Extreme Networks
  • Old port group is the same VLAN, single uplink, different port group name, different virtual distributed switch.
  • New port group is the same VLAN, single uplink, different port group name, different virtual distributed switch.
  • Old VDS is using “route based on originating virtual port”
  • New VDS is using “route based on physical NIC load”
  • VMs are running successfully on the new port group/VDS/uplink
  • On the switch, MLAG is turned off and the switchports are configured as VLAN trunks


Process:

  1. We change the network label (port group) from ‘Old PG’ to ‘New PG’
  2. Press ‘OK’
  3. Symptoms begin


Symptoms:

  1. Happens on multiple hosts (all with known good VMs in respective port groups)
  2. VM cannot ping its Default Gateway
  3. Clients on other switches cannot ping the VM
  4. VM can ping other VMs on the new port group (same VLAN)
  5. VM cannot ping other VMs on the old port group (same VLAN)
  6. When we were able to reproduce the issue, it took ~10 minutes before connectivity was restored with no action on our part.


What we tried to restore connectivity:

  1. Move back to old PG (this worked some of the time)
  2. Disconnect NIC->press OK->reconnect NIC->press OK via vSphere Client  (did not work)
  3. Disconnect->press OK->change port group back to old->press OK->reconnect->press OK  (did not work)
  4. Reboot the guest OS  (did not work)
  5. On the guest OS: arp –d  (did not work)
  6. On the guest OS: disable/enable the network adapter (Windows and Linux)


Research indicates this is an issue on the physical switch, but we cannot find any (obvious) errors/configuration issues.


Some research links...

  • https://communities.vmware.com/thread/420029
  • https://communities.vmware.com/thread/462638


Comments

Popular posts from this blog

Learning through failure - a keyboard creation journey

Canary deployments of IIS using Octopus & AWS

Learning Opportunities - Watching/listening list