I am facing a strange issue, that I cannot reproduce but appears every couple of months on different ESXi hosts.
It is very annoying, because all the VMs on this host become very slow and are unusable.
After performing a huge workload on the disk subsystem (>100MB/s for several days) the disc latency goes up from 0-20ms to >20.000ms. In some cases I had more than 100.000ms. The VMs are still running, but as you can imagine they nearly dont react. For example a normal Windows login with local administrator takes more than an hour.
How to solve?
The only way to solve the issue is to stop all the VMs and restart the host after the restart everything is working fine again.
3x ESXi 4.1 (260247) on HP BL490cG7 (all identical)
1x HP MSA2324i G3 24x500GB SAS connected via 4x1GBit ISCSI
I already did a lot of testing and tried a lot of different settings (too much to list all of them), but here are some key findings:
- The issue is not reproducable, at least I didnt manage, but happens every 2-3 months
- The problem happened already with all 3 ESXi hosts, but never at the same time
- I assume that the problem is related to the hosts, not to the storage, the other two hosts access the same LUN on the same MSA and they are working normally. After the restart of the host, everything is fine
- the same occurred already with ESXi 4.0
Please see below some screenshots, so that you get an idea:
This is the disk activity of the last 12 months. As you can see, the workload increased quite a lot the last weeks.
This behavior is expected, because we added some huge databases to this host recently.
Below you can see what happened when the issue occured.
The disk activity went down immediatelly to nearly 0 until the host was restarted.
During this period, the latency was >20.000ms, but came down to normal after the reboot.
I hope that someone was facing this issue already and could give me some advise.