We have 2 vsphere clusters and are migrating VM's from the v5.5 to v6.5. On the exact same hardware (2-socket xeon E5-2689 v4, 2 x (10 pCores / 20 thread) = 40 threads), we have a huge performance drop between 5.5 and 6.5 while migrating a large VM.
The host are in Ivy-bridge compatibility mode on the v5.5 cluster, and in Haswell compatibility mode on the v6.5 cluster.
The VM impacted is a 20 vcore SQL Server 2k17 (so mission critical). This VM is 4 vSocket and 5 cores per socket (for SQL Standard licensing reasons)
The metric we are seeing is a high response time on the guest SQL, and a high CPU ready ratio on the v6.5 cluster (10 to 20% compared to 0.5% when on the old v5.5 cluster).
One interesting thing : With the CPU-Z benchmark, we are seeing a big difference between v5.5 et v6.5
- on the v5.5 cluster, on a test 20 vcore test VM (on an empty host, just this VM), we can have a "Multi Thread Ratio" of 19.33 for 20 vcore, so basically, the v5.5 CPU scheduler uses all the pCores and not the HT cores of both sockets (since I assume an HT core is around 30 or 40% performance increase not more.).
- on the v6.5 cluster, same hardware, same empty host but the test VM, we cannot go past a 15.5 "Multi-thread ratio". Which we may interpret as : "the v6.5 Cpu Scheduler uses 10 cores at 100%, then 10 other cores at 55% of the raw power). Since these 55% are more than what HT-cores can give, we are suspecting there's something related to vSphere 6.5 which cause this big performance drop.
These test were made on hosts without other VM.
We have tried lot of combinations while doing our tests.
- enabling or disabling "Cpu Hot Add" in order to switch to legacy UMA instead on vNuma
- enabling or disabling the numa.vcpu.preferHT setting
- changing the numa.vcpu.maxPerVirtualNode to 10, then 20
- tried 2 socket x 10 vcores, 1 socket x 20 vcores, 20sockets x 1 vcore
- tried changing the power management setting, or the latency
What is really strange, is until 10 core, the VM have a "multi thread ratio" near what we expect (10 core = 9.62 ratio ), but beyond 10 cores, things degrades (12 cores = 8.75 and raw performance is lesser than with 10 cores...). While migrating, we have noticed some performance gains on small VM (up to 8 vCores), but for our 20 cores one it's a fail...
Except the CPU compatibility model (ivy-bridge / haswell), the only difference we see between 5.5 and 6.5 in vmlogs is the spectre cpuid on v6.5 (cpuid.IBRS, cpuid.IBPB, cpuid.STIBP). I don't find any of theses lines in the 5.5 logs. But since public benchmarks doesn't report 30% performance drop on spectre/meltdown patchs, we think this may be unrelated.
Last detail : our v6.5 is patched for L1TF (Sequential-context attack vector) since we are seeing the "
esx.problem.hyperthreading.unmitigated" warning on our hosts. However, we didn't activate the VMkernel.Boot.hyperthreadingMitigation setting. (Concurrent-context attack vector patch).
We are out of idea, and would really like to get the same performances on v6.5 than we have on v5.5. (since it's the same hardware)
If someone could shed some light on this topic, we would be infinitely grateful !
Thanks in advance !