Solved: Re: All guests on host loose network access after ...

CVCX · ‎11-25-2019

I just started using Veeam Backup, and it breaks the network on one of my host servers. I have been having this same issue intermittently for months now, but Veeam Backup reliably breaks the network on this host.

This is not an intermittent issue where the host is resources are over taxed and the guests loose network access intermittently during the backup.

The only things on this host are the VCSA VM and my cameras system VM.

Heres what happens, I run Veeam Backup against my camera system VM, as soon as Veeam starts transferring data to my NAS, the VCSA drops off the network, then the Camera VM starts to loose connectivity to the cameras, then it drops off the network too. I've let it sit for a few hours, the guests never regain network access. Rebooting the host brings the network back up.

I did notice there were a bunch of "VMXNET3 TX HANG" messages on my camera VM. After some googling I switched the NIC on that VM to the e1000 vNIC, this had no effect. After some more googling I tried disabing, TSO / LRO, first on the guests, then on the host, then on both. None of this has any effect.

I also tried uninstalling open-vm-tools and installing the VMware Tools provided by vSphere, no luck.

This doesn't seem to effect my other VMware host, just the one running my camera system (Ubuntu 16.04) and the VCSA VM (VMware Photon)

Output of VMKernel follows.

2019-11-26T06:25:43.536Z cpu22:2111250)WARNING: UserSocketInet: 2266: vsanmgmtd: waiters list not empty!

2019-11-26T06:25:43.537Z cpu22:2111250)WARNING: UserSocketInet: 2266: vsanmgmtd: waiters list not empty!

2019-11-26T06:25:44.609Z cpu10:2097865)DVFilter: 5963: Checking disconnected filters for timeouts

2019-11-26T06:35:44.597Z cpu7:2097865)DVFilter: 5963: Checking disconnected filters for timeouts

2019-11-26T06:45:44.586Z cpu0:2097865)DVFilter: 5963: Checking disconnected filters for timeouts

2019-11-26T06:50:44.655Z cpu11:2099882)VSCSI: 6602: handle 8205(vscsi0:0):Destroying Device for world 2099858 (pendCom 0)

2019-11-26T06:50:44.655Z cpu11:2099882)VSCSI: 6602: handle 8206(vscsi0:1):Destroying Device for world 2099858 (pendCom 0)

2019-11-26T06:50:44.656Z cpu11:2099882)cswitch: VSwitchDisablePTIOChainRemoveCB:1149: [nsx@6876 comp="nsx-esx" subcomp="vswitch"]Remove ptDisable IOChain Handle for port 0x2000008

2019-11-26T06:50:44.656Z cpu11:2099882)NetPort: 1580: disabled port 0x2000008

2019-11-26T06:50:44.671Z cpu11:2099882)CBT: 723: Disconnecting the cbt device 70d63-cbt with filehandle 462179

2019-11-26T06:50:44.773Z cpu11:2099882)CBT: 723: Disconnecting the cbt device 70d62-cbt with filehandle 462178

2019-11-26T06:50:45.568Z cpu11:2099882)CBT: 1352: Created device 80d67-cbt for cbt driver with filehandle 527719

2019-11-26T06:50:45.569Z cpu11:2099882)VSCSI: 3782: handle 8207(vscsi0:0):Using sync mode due to sparse disks

2019-11-26T06:50:45.569Z cpu11:2099882)VSCSI: 3810: handle 8207(vscsi0:0):Creating Virtual Device for world 2099858 (FSS handle 920937) numBlocks=335544320 (bs=512)

2019-11-26T06:50:45.569Z cpu11:2099882)VSCSI: 273: handle 8207(vscsi0:0):Input values: res=0 limit=-2 bw=-1 Shares=1000

2019-11-26T06:50:45.572Z cpu11:2099882)FDS: 617: Enabling IO coalescing on driver 'deltadisks' device '2a0d6b-Ubuntu Zoneminder_1-000001-sesparse.vmdk'

2019-11-26T06:50:45.573Z cpu11:2099882)CBT: 1352: Created device 120d6d-cbt for cbt driver with filehandle 1183085

2019-11-26T06:50:45.573Z cpu11:2099882)VSCSI: 3810: handle 8208(vscsi0:1):Creating Virtual Device for world 2099858 (FSS handle 1314159) numBlocks=8589934592 (bs=512)

2019-11-26T06:50:45.573Z cpu11:2099882)VSCSI: 273: handle 8208(vscsi0:1):Input values: res=0 limit=-2 bw=-1 Shares=1000

2019-11-26T06:50:45.574Z cpu11:2099882)NetPort: 1359: enabled port 0x2000008 with mac 00:0c:29:36:01:fb

2019-11-26T06:50:45.574Z cpu11:2099882)cswitch: VSwitchDisablePTIOChainRemoveCB:1149: [nsx@6876 comp="nsx-esx" subcomp="vswitch"]Remove ptDisable IOChain Handle for port 0x2000008

2019-11-26T06:55:44.574Z cpu7:2097865)DVFilter: 5963: Checking disconnected filters for timeouts

2019-11-26T07:00:44.937Z cpu20:2099643)Vmxnet3: 24930: VMware vCenter Server Appliance,00:0c:29:35:dd:b6, portID(33554439): Hang detected,numHangQ: 2, enableGen: 2

2019-11-26T07:00:44.937Z cpu20:2099643)Vmxnet3: 24939: portID:33554439, QID: 0, next2TX: 2842, next2Comp: 1472, lastNext2TX: 1474, next2Write:1893, ringSize: 4096 inFlight: 4, delay(ms): 2656,txStopped: 0

2019-11-26T07:00:44.937Z cpu20:2099643)Vmxnet3: 24943: portID: 33554439, sop: 1472 eop: 1472 enableGen: 0 qid: 2, pkt: 0x45a242feb680

2019-11-26T07:00:44.937Z cpu20:2099643)Vmxnet3: 24939: portID:33554439, QID: 1, next2TX: 3066, next2Comp: 3068, lastNext2TX: 3070, next2Write:3629, ringSize: 4096 inFlight: 21, delay(ms): 26237,txStopped: 0

2019-11-26T07:00:44.937Z cpu20:2099643)Vmxnet3: 24943: portID: 33554439, sop: 3068 eop: 3069 enableGen: 1 qid: 2, pkt: 0x45a25a276780

2019-11-26T07:00:44.938Z cpu20:2099643)NetSched: 717: 0x2000004: received a force quiesce for port 0x2000007, dropped 239 pkts

2019-11-26T07:00:44.945Z cpu20:2099643)cswitch: VSwitchDisablePTIOChainRemoveCB:1149: [nsx@6876 comp="nsx-esx" subcomp="vswitch"]Remove ptDisable IOChain Handle for port 0x2000007

2019-11-26T07:00:44.946Z cpu20:2099643)NetPort: 1580: disabled port 0x2000007

2019-11-26T07:00:44.953Z cpu20:2099643)Vmxnet3: 18576: indLROPktToGuest: 0, vcd->umkShared->vrrsSelected: 1 port 0x2000007

2019-11-26T07:00:44.953Z cpu20:2099643)Vmxnet3: 18817: Using default queue delivery for vmxnet3 for port 0x2000007

2019-11-26T07:00:44.953Z cpu20:2099643)NetPort: 1359: enabled port 0x2000007 with mac 00:0c:29:35:dd:b6

2019-11-26T07:00:44.953Z cpu20:2099643)cswitch: VSwitchDisablePTIOChainRemoveCB:1149: [nsx@6876 comp="nsx-esx" subcomp="vswitch"]Remove ptDisable IOChain Handle for port 0x2000007

CVCX · ‎11-27-2019

Issue resolved.

I had to update the driver for the Intel 82576 NICs in my server from the igb version 5.0.5.1.1-5vmw driver that came bundled with ESXi to the igb version 5.2.5 driver supplied by Intel.

The updated driver can be downloaded from here : VMware vSphere 5: Private Cloud Computing, Server and Data Center Virtualization

Instructions for installing the driver are in the readme file that comes bundled with the driver.

View solution in original post

tayfundeger · ‎11-26-2019

What version of ESXi have you used? Build number?

Can you also specify the version of the vmware tools on the virtual machine? I understand correctly that vCenter Server's communication is breaking when the camera server wants to be taken backup?

--
Blog: https://www.tayfundeger.com
Twitter: https://www.twitter.com/tayfundeger

vBlogger, vExpert, Cisco Champions

Please, if this solution helped your problem, "Helpful" if it solves your problem "Correct Answer" to mark.

CVCX · ‎11-26-2019

ESXI Version: 6.7.0 Update 3 (Build 15018017)

VCSA Version: 6.7.0.41000 (Build: 14836122)

Camera server VMware Tools Version (Open-VM-Tools, installed from APT): version:10304

VCSA VMware Tools Version: version:10309

What is happening is that the guest VMs on this host (Camera server and VCSA) are loosing network access when Veeam begins transferring data to to the NAS.

Yesterday I tried setting up a test VM (Ubuntu 16.04) with the latest version of VMware Tools installed, e1000 vNIC. This VM also looses network access when the backup starts transferring to the NAS.

blazilla · ‎11-26-2019

Hi,

do you use NBD transport mode with Veeam??

Best regards Patrick https://www.vcloudnine.de

CVCX · ‎11-26-2019

I'm not sure, I only installed Veeam yesterday. So it's using what ever the defaults are. The backup target is an SMB share, so maybe.

I've been having this issue with the guests on this loosing network access intermittently for the past few months now, so there is an underlying issue that Veeam is reliably triggering.

blazilla · ‎11-26-2019

Is your Veeam Backup Server a VM? If yes, chances are high that you are using Virtual Appliance mode.

Best regards Patrick https://www.vcloudnine.de

CVCX · ‎11-26-2019

I think im using VIrtual Appliance mode, the Veeam server is a VM on my other ESXI host, and the backup proxy is the Veeam VM

blazilla · ‎11-26-2019

Does a Windows VM face the same issues?

Best regards Patrick https://www.vcloudnine.de

CVCX · ‎11-26-2019

I was able to back up my windows PRTG vm on my other host with out issues. I read that the Veeam VM should be excluded from backing up itself, so I haven't tried to back up that VM.

My vSphere license doesn't include vMotion so I can't migrate PRTG VM to the other host to see if backing it up crashes the networking or not.

blazilla · ‎11-26-2019

You can try to switch to NBD transport mode. How does your vSwitch/ VMKernel port setup looks like?

Best regards Patrick https://www.vcloudnine.de

CVCX · ‎11-26-2019

Unfortunately I don't have the spare windows physical machines to make NBD mode work, we are an all Mac shop except for the servers / VMs, and all the Macs are currently in use.

I have been running into this issue on occasion for months, and I just installed Veeam yesterday. Veeam is only triggering what ever the underlying issue is.

Screen shots of the vSwitch and VMkernel NIC are attached.

The configuration of the etherchanel from my switch follows:

interface Port-channel3

switchport trunk encapsulation dot1q

switchport mode trunk

switchport nonegotiate

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

interface GigabitEthernet1/0/5

switchport trunk encapsulation dot1q

switchport mode trunk

switchport nonegotiate

channel-group 3 mode on

!

interface GigabitEthernet1/0/6

switchport trunk encapsulation dot1q

switchport mode trunk

switchport nonegotiate

channel-group 3 mode on

blazilla · ‎11-26-2019

Please switch to NBD and test the backup.

Best regards Patrick https://www.vcloudnine.de

CVCX · ‎11-27-2019

Well, I think I'm going to have to move the VCSA to the other ESXi host, and just rebuild the ESXi host that's having network issues as a physical Ubuntu host for the camera system, unless someone knows what the log messages mean.

CVCX · ‎11-27-2019

Anyone have an idea what's going on here?

CVCX · ‎11-27-2019

I tried unplugging the cable from the 2nd NIC, then doing the backup. The host became inaccessible this time. I was able to get the network back up, for the host as well as the VMs by disconnecting and reconnecting the cable

CVCX · ‎11-27-2019

Issue resolved.

I had to update the driver for the Intel 82576 NICs in my server from the igb version 5.0.5.1.1-5vmw driver that came bundled with ESXi to the igb version 5.2.5 driver supplied by Intel.

The updated driver can be downloaded from here : VMware vSphere 5: Private Cloud Computing, Server and Data Center Virtualization

Instructions for installing the driver are in the readme file that comes bundled with the driver.

blazilla · ‎11-27-2019

Thank you for your update and the solution to your problem!

Best regards Patrick https://www.vcloudnine.de

All

All guests on host lose network access after backup fails