VMware Cloud Community
itburnout
Contributor
Contributor

Random Packet Loss When Using IP Hash Nic Teaming

Hello all,

I recently upgraded to ESX4 from ESX3.5U2. I have 2 hosts in an HA cluster running approx 20 VMs. I attempted to configured "Route based on IP hash" and configured the corresponding port-channel on our cisco catalyst 3750. I have attached the relevant config to our switch for your pleasure....

All seemed to be OK as I could ping everything as soon as the port-channel stabilized with the NICs but I started to randomly drop packets. This was wreaking havoc on our network as our domain controllers and dns servers are virtualized. There seemed to be no pattern but gusts would just drop off of the map for up to 30 seconds and then come back as if nothing had happened. During this period of time I believe I was able to ping the host and I am trying to reproduce the issue this morning. I have since broken the port-channel on the switch for one of my hosts and vmotioned all of the vms from the other host to it.

I would like to add that when I break the port-channel and set the nic team to route based on incoming link id and it works fine. Something is amiss with the port channel but I an cisco cannot figure it out.

Has anyone seen this behavior? The config for the switch is by the book and i have no obvious errors on the switchport and channel-group interfaces.

Thanks for your help!

Reply
0 Kudos
2 Replies
admin
Immortal
Immortal

Hi itburnout,

It seems that obvious configure error. By my experience no other config but "channel-group xxxx" is needed when configuring port channel on a interface, could you try remove the unecessary config " switchport access vlan 63

switchport mode access" on each interface?

Another cocern is what "port-channel load-balance" method are you configuring on your switch? src-dst-ip should be used to work with the IP Hash policy.

Reply
0 Kudos
itburnout
Contributor
Contributor

Hello Chris,

It turns out that even when I broke the port channel and eliminated any nic teaming, the guests would intermitently drop packets. I contacted support and the tech said that another support ticket had been opened earlier that day with the exact same issue and topology as me.

I turns out that the software iSCSI initiator was to blame. I had added/changed/deleted luns on our iscsi san and hadn't performed a rescan datastore on the cluster. When the host would poll for the now non-existent volume, the whole esx host would freak out. He said this was likely a bug in esx4. Since I performed a rescan I haven't had any problems.

I have since brought my channel-group back online and everything is working like a champ.

Thanks!

Reply
0 Kudos