ESXi 6.7 running on an HP DL380 Gen9 with only 2 VMs:
CPU & memory-wise, the server look to be idle, but randomly, clients get short-duration timeouts (5-20 seconds) connecting to the internal web application running on the 2003 server.
This was not happening with the original server prior to P2V conversion. Haven't noticed any "freezes" when on the server desktop or found any interesting messages in the Win event log.
Really just looking for suggestions of where to look for hints.
Things I have tried:
Rebooting the VM *seems* to improve the situation for a short while (but since it's intermittent, it's hard to tell if that's truly the case...)
Thanks for any ideas
So you're trying to nurse a decade-plus-old VM back to health This might be a good moment to get off that antiquated stuff.
Is VMware tools installed inside this VM? If so, what version? What is the hardware version of this ancient beast?
Well, true- I am trying to nurse a decade-old system, but it wasn't a VM until recently...
The hardware version is 13. Tools are v10.0.12 b4448496
I would really like to chuck the old system in the dumpster, but it's running an old, unsupported ERP system that we have little hope of being able to fresh install on a later version of Windows Server. Not a situation anyone wants to be in, but these things happen- and I guess that's when the vmware converter at least tries to come to the rescue. Long term plan is to ditch it, but that could take years just to vet a replacement.
Ok - I appreciate the idea. I think the initial hw version I tried was 12, and I upgraded it to 13 afterward.
I'll try using the converter to build a hw v10-or-less vm as soon as I get an opportunity.
So I tried converting the VM to virtual hardware v8 and unfortunately the same issue occurred. (I kept it standard, and didn't mimic all of the tweaks for performance I had tried on the previous VM.)
After a long and painful week of trial and error, I did manage to figure out how to reinstall all of the components, so that this machine could be rebuilt instead of using the P2V utility on an old server, but I still needed to start with a fresh VM of Win2003, and ultimately, the same issues occurred with the new one.
I'm currently working to finagle the components into installing on a fresh Win2012 VM, but it's rough going, since the applications are nearly a decade old.
The thing I find so strange about this issue is that it comes and goes. The first day we used the VM, users said it performed better than they had ever seen. The next day, they said it was borderline unusable. The following day, it was performing well again.
The server never seems to be under duress regardless of whether it's performing well or poorly at the time.
The applications are all running in IIS, and the "network accessibility" issue usually manifests itself as excessive delays between page loads on the clients, but also sometimes results in a full timeout and error in the browser. These don't happen on the old physical server.
Well, the host doesn't have much else running on it. It has one Win7 VM, and an Ubuntu DNS server.
We haven't noticed any issues with either of them, but they're of course running different services...
We don't have the resources to build a full lab environment to do testing, I've just been firing up the VM in production, and listening for user feedback.
I don't normally use IIS- I'll try looking for some performance data from IIS, to see if it might provide some hints.
I really don't know if the VM's performance/connectivity issues are on the network side, sending replies and receiving requests, or if they are on the application side.
Since last I posted, I've managed to rebuild the entire guest system beginning with a fresh install of Server 2012, and ultimately got the same results.
In researching my problem, I've also seen a large number of past & present issues with Broadcom NICs. This HP server has 4 BCM5719 embedded NICs, and 2 Intel 10Gb NICs.
I've just realized that the vmkernel.log is full of thousands (10 lines or so every 30 seconds) of error messages that in some way relate to the Broadcom NICs.
"2018-09-06T07:11:37.587Z cpu13:2099148)MemSchedAdmit: 470: Admission failure in path: nicmgmtd/nicmgmtd.2099148/uw.2099148"
"2018-09-06T07:11:37.587Z cpu13:2099148)MemSchedAdmit: 477: uw.2099148 (9114) extraMin/extraFromParent: 117/117, nicmgmtd (806) childEmin/eMinLimit: 2478/2560"
I couldn't really find any information about these errors online at all.
Moving all the traffic to the Intel NICs, and disconnecting them halted the errors. I'll have to wait a few days collecting user feedback to see if this has had any impact.
Examining the Broadcom NICs firmware, (1Gb 4-port 331i Adapter (22BE) I found its Boot Code to be the latest version, but the NCSI version is 188.8.131.52 and the latest from HP shows 184.108.40.206. However, I can't get HPs installer to execute on ESXi 6.7 for some reason, so I guess I'll have to ask HP about that.
I doubt any of this is very useful to others, but documenting my experience all the same...
Disconnecting the Broadcom NICs halted the errors temporarily, but they ended up returning many hours later.
After disabling the embedded Broadcom quad Nic card last Saturday, the "admission failure" messages all stopped that day and have not returned, for what that's worth.
I don't know if that helped anything beyond getting rid of log bloat yet...
I found a similar case with "admission failure" messages reported. Can you try disabling netqueue on the card.
esxcli network nic queue loadbalancer set --rsslb=off -n vmnicX