Been a long time Intel shop and we are going through another lifecycle replacement of our servers. We have been working with Dell to test out a pair of R6525's running the AMD Milan EPYC 7543 32-Core procs. We loaded them with ESXi 7.02 (and recently applied the latest 7.02a update). We have a large 48 node Intel based cluster with several hundred RDP VMs that participate in a load balance pool so the workload is identical. We moved some of these VM's onto a dedicated 2 node EPYC cluster and initially things looked good, until we started having machine check error crashes that just rebooted the host, no PSOD. VMware recommended a driver update based on something they saw in the logs. Since doing that, we have not had any more of them. But what we did notice is that both EPYC hosts randomly hit 99% CPU usage for varying amounts of time. With no VM's running on one of these hosts, CPU is quiet. You move just 1 VM onto it, and spiking occurs. There is not enough CPU allocated to 1 VM to cause the host to spike like this so i can rule out anything having to do with activity going on inside the guest OS.
We do have active tickets with VMware and Dell on this, but I am getting to the point of frustration so I though I would post this issue here to see if anyone has anything they can think of that would be causing this.
BTW...VM specs are: 1 Socket, 4 cores, 16GB RAM, 100GB hard drive. VM HW version vmx-19. Server host is fully updated on firmware and BIOS setting are set to Performance (by Dell). We tried tweaking NUMA configs....that actually made things worse. Generally speaking, the Hosts runs great and our apps run good as well....but we cant deploy a product with this issue.....so again, any help would be greatly appreciated.