VMware Cloud Community
MightyGorilla
Contributor
Contributor

Suggestions for troubleshooting intermittent network accessibility to VM

ESXi 6.7 running on an HP DL380 Gen9 with only 2 VMs:

  • 1 guest OS is Win2003R2 x64 (and was built using the vmware converter)
  • The other is Win7pro x64

CPU & memory-wise, the server look to be idle, but randomly, clients get short-duration timeouts (5-20 seconds) connecting to the internal web application running on the 2003 server.

This was not happening with the original server prior to P2V conversion. Haven't noticed any "freezes" when on the server desktop or found any interesting messages in the Win event log.

Really just looking for suggestions of where to look for hints.

Things I have tried:

  • Updated all firmware & BIOS of host
  • BIOS: Switched from balanced to high performance power management
  • BIOS: Verified that virtualization settings are enabled
  • BIOS: Disabled all C-States
  • Disabled unused hardware
  • Removed the E1000 virtual NIC & Added Vmxnet3
  • Dedicated 1 of the physical NICs to the win2003 server
  • Removed drivers for HP-specific hardware from the VM after conversion
  • Tried setting Network Latency Sensitivity to HIGH with no changes
  • ESXi virtual switch interfaces were all changed from "1000, Full" to "Automatic" to match physical switch ports

Rebooting the VM *seems* to improve the situation for a short while (but since it's intermittent, it's hard to tell if that's truly the case...)

Thanks for any ideas Smiley Happy

Reply
0 Kudos
11 Replies
daphnissov
Immortal
Immortal

So you're trying to nurse a decade-plus-old VM back to health Smiley Happy This might be a good moment to get off that antiquated stuff.

Is VMware tools installed inside this VM? If so, what version? What is the hardware version of this ancient beast?

Reply
0 Kudos
MightyGorilla
Contributor
Contributor

Well, true- I am trying to nurse a decade-old system, but it wasn't a VM until recently... Smiley Happy

The hardware version is 13. Tools are v10.0.12 b4448496

I would really like to chuck the old system in the dumpster, but it's running an old, unsupported ERP system that we have little hope of being able to fresh install on a later version of Windows Server. Not a situation anyone wants to be in, but these things happen- and I guess that's when the vmware converter at least tries to come to the rescue. Long term plan is to ditch it, but that could take years just to vet a replacement.

Reply
0 Kudos
daphnissov
Immortal
Immortal

You may want to try and downgrade the virtual hardware to something no higher than about 10.

Reply
0 Kudos
MightyGorilla
Contributor
Contributor

Ok - I appreciate the idea. I think the initial hw version I tried was 12, and I upgraded it to 13 afterward.

I'll try using the converter to build a hw v10-or-less vm as soon as I get an opportunity.

Thanks!

Reply
0 Kudos
MightyGorilla
Contributor
Contributor

So I tried converting the VM to virtual hardware v8 and unfortunately the same issue occurred. (I kept it standard, and didn't mimic all of the tweaks for performance I had tried on the previous VM.)

After a long and painful week of trial and error, I did manage to figure out how to reinstall all of the components, so that this machine could be rebuilt instead of using the P2V utility on an old server, but I still needed to start with a fresh VM of Win2003, and ultimately, the same issues occurred with the new one.

I'm currently working to finagle the components into installing on a fresh Win2012 VM, but it's rough going, since the applications are nearly a decade old.

The thing I find so strange about this issue is that it comes and goes. The first day we used the VM, users said it performed better than they had ever seen. The next day, they said it was borderline unusable. The following day, it was performing well again.

The server never seems to be under duress regardless of whether it's performing well or poorly at the time.

The applications are all running in IIS, and the "network accessibility" issue usually manifests itself as excessive delays between page loads on the clients, but also sometimes results in a full timeout and error in the browser. These don't happen on the old physical server.

Reply
0 Kudos
Lalegre
Virtuoso
Virtuoso

You have the same issue regarding other virtual machines on the host?

Reply
0 Kudos
MightyGorilla
Contributor
Contributor

Well, the host doesn't have much else running on it. It has one Win7 VM, and an Ubuntu DNS server.

We haven't noticed any issues with either of them, but they're of course running different services...

We don't have the resources to build a full lab environment to do testing, I've just been firing up the VM in production, and listening for user feedback. Smiley Happy

I don't normally use IIS- I'll try looking for some performance data from IIS, to see if it might provide some hints.

I really don't know if the VM's performance/connectivity issues are on the network side, sending replies and receiving requests, or if they are on the application side.

Reply
0 Kudos
MightyGorilla
Contributor
Contributor

Since last I posted, I've managed to rebuild the entire guest system beginning with a fresh install of Server 2012, and ultimately got the same results.

In researching my problem, I've also seen a large number of past & present issues with Broadcom NICs. This HP server has 4 BCM5719 embedded NICs, and 2 Intel 10Gb NICs.

I've just realized that the vmkernel.log is full of thousands (10 lines or so every 30 seconds) of error messages that in some way relate to the Broadcom NICs.

"2018-09-06T07:11:37.587Z cpu13:2099148)MemSchedAdmit: 470: Admission failure in path: nicmgmtd/nicmgmtd.2099148/uw.2099148"

"2018-09-06T07:11:37.587Z cpu13:2099148)MemSchedAdmit: 477: uw.2099148 (9114) extraMin/extraFromParent: 117/117, nicmgmtd (806) childEmin/eMinLimit: 2478/2560"

I couldn't really find any information about these errors online at all.

Moving all the traffic to the Intel NICs, and disconnecting them halted the errors. I'll have to wait a few days collecting user feedback to see if this has had any impact.

Examining the Broadcom NICs firmware, (1Gb 4-port 331i Adapter (22BE) I found its Boot Code to be the latest version, but the NCSI version is 1.4.18.0 and the latest from HP shows 1.4.22.0. However, I can't get HPs installer to execute on ESXi 6.7 for some reason, so I guess I'll have to ask HP about that.

I doubt any of this is very useful to others, but documenting my experience all the same...

Reply
0 Kudos
MightyGorilla
Contributor
Contributor

Disconnecting the Broadcom NICs halted the errors temporarily, but they ended up returning many hours later.

After disabling the embedded Broadcom quad Nic card last Saturday, the "admission failure" messages all stopped that day and have not returned, for what that's worth.

I don't know if that helped anything beyond getting rid of log bloat yet...

Reply
0 Kudos
jameswalkervmw
VMware Employee
VMware Employee

Hello,

I found a similar case with "admission failure" messages reported. Can you try disabling netqueue on the card.

esxcli network nic queue loadbalancer set --rsslb=off -n vmnicX

Thanks,

James

James Walker VMware Support Moderator
Reply
0 Kudos
BorgSquirrel
Contributor
Contributor

I found this when submitting a case to VMware today, seems spot on for the admission failure problem.

VMware Knowledge Base

Reply
0 Kudos