I have two Dell R740s, both running EXSI 6.5, and each server only runs one VM(Win Server), to which all resources are allocated.
One of them is running well, and another is in a terrible performance because it's very very slow. I really cannot image a simple server with 32 vCPUs as slow as that.
And what most strange is, if I log in to the VM and open windows task manager, the CPU usage is very low(10%). but if I log in EXSI web console, the host CPU is high(80%).
I tried to use esxtop in the CLI, it shows 0.9 average and most resource used by the only VM.
Tried to search but nothing found, anybody ideas or helps are great appreciated.
So there are a few things going on here, one is the difference in ESXi host / VM perceived usage (%). That is normal. Usage is a qualitative counter and normalized to a single core at nominal frequency, since you have more vCPUs assinged to the VM than cores available on the host, it will be off. There are a bunch of other differences that I'll have to write about another time but for now, just know the difference is normal. That includes what you see in Task Manager, it is assuming 32 vCPUs for _utilization_ (not idle), not usage, it isn't aware of frequency, HT etc.
Now, the host does seem to be somewhat frequency scaled, that might not be the reason for the performance issue but should at least be eliminated.
Can post another screenshot of esxtop in the power management screen (p) and enable amperf? (f / f) You might also want to watch the first part here: https://www.vmworld.com/en/video-library/video-landing.html?sessionid=15614128031020019IBb®ion=EU
Yeah, the host has a "Dynamic" Power Policy but the frequency scaling isn't all that bad, it's not very likely that this is the cause of the issue if you see the same (no P-States but %C2 and %A/MPERF < 100 on some PCPUs) on the other host. Still, I'd recommend to change it (change to max perf in the BIOS, then to custom and allow for C1E, deep C-States and P-State control to the "OS", so ESXi, if it allows for HWP, make sure you enable legacy P-States).
Well, the VM on the second host isn't as busy. Whatever caused the utilization difference is running in the guest OS (or _theoretically_ also something on the CPU / using different instructions / monitor overhead, very unlikely though). What are the processes that are running on the slow / busy VM? Why are they consuming as much CPU as they do? That would have to be answered using in guest tools.
Honestly speaking, what makes me confused is there should be no difference between these 2 servers. They are 2 windows servers with same platform/hardware and both in virtualization environment, and they are in one windows High Availability group.
The processes of the two servers should also be the same, as I said, if I logon to the slow server and open task manager, everything is fine. the CPU is low, the memory is low, the disk is low.... But it's very slow, it tooks 1 minute more to open windows task manager.
So does the issue follow the VM? Do you have vCenter? Can you migrate the VM to the other host and try?
If not, are you 100% sure that when you look at perfmon, they they have the same resource usage per process? Can you post a task manager screenshot (performance tab / all cores) from both VMs?
Yes we have vCenter but a pity we don't have enough resource to move this server to another.
For the 2 server performance please see the pictures.
The problem is that even if the CPU is a little bit high in the slower server, I still cannot image this server(with 32 vCPUs and 128GB Memory) need more than 1 minute to open windows task manager, that's why I wondering if it's a VM issue.
You don't have to exhaust all available resources to have performance issue, i.e. you don't have to be CPU capped. There could be all kinds of things happening that could cause this, maybe the source of the problem also results in the increased CPU usage compared to that other VM, or it is the other way around. You should investigate where you lose time when you open Task Manager, e.g. first with ProcMon w/ trace events, then maybe get an xperf trace with stackwalk. This is OS level work and you might need to involve the vendor if you need help with that.
To fully eliminate an issue on the host, you might find the downtime to have the VMs switch hosts at some point. (if the issue _doesn't_ move with the VM, then there could also be e.g. fabric / port / HW issues etc.)