The company I work for runs more or less identical datacenters in a number of remote sites: two ESXi 6.7 or 7.0 hosts, with about 5-7 Windows VMs (and one small Linux appliance) split across them, with a total core count of about 20 per host across active VMs. This is what the load looks like:
Each ESXi hosts has two Intel Xeon Silver 4110 or 4210 CPUs, 192GB RAM, 4x mechanical disks in RAID5.
The VMs running on them sometimes struggle - we're seeing slow UI response, high CPU and disk queue numbers. It could be something with the VMs, or could be that the ESXis aren't sized correctly for the workload.
From the look of it, the choke points are the disks, especially in write IOPS. (What would be a good way to confirm it, what metrics, exposed where? Normally it's queue length over time, and latencies - yet I don't see these exposed on ESXi level in vSphere...)
That said, could the CPUs also be the choke points?
Adding more ESXi hosts is not an option at the moment, but we could plan to use different CPUs (more cores? more GHz?) going forward, and also consider switching mechanical disks to SSDs.