For a while now I've been mildly annoyed by an intermittent issue that I've been having in my ESXi home lab setup, where certain VMs become unresponsive, or the interface through VMware Infrastructure Client becomes unresponsive ... or both. The only resolution is to restart the entire host. Sometimes the box can go over a week without needing a reboot, and sometimes it will only last a few hours. After a particularly bad day of needing to be rebooted about 4 or 5 times, I started to look into this a little deeper, and I've been able to reproduce what looks like the same issue, although still intermittently.
Allow me to briefly explain my tests, and the outcomes:
1. Moving a file from one datastore to another when no VMs are powered on, using the 'mv' command in SSH session. After a while I get "I/O error", and whole box hangs.
2. Moving THE SAME FILE as above, in the same circumstances one day later. Succeeds. Odd.
3. Moving another file between the same two datastores. 10 minutes in, and the SSH window hasn't returned to the command prompt. Causes my other SSH window to become very sluggish, the VI Client becomes unresponsive, and I get "The request failed because the remote server took too long to respond." error when I try to do just about anything. There aren't any errors in the SSH session this time, however.
So I seem to have three different outcomes for what I consider to be the same action. When I try #3 again with VMs powered up this time, lo and behold they become dead to the world after about 10 minutes. The System Logs have a number of "Failed to validate VM IP address: unknown" errors in them, along with lines similar to "vm:/vmfs/volumes/4900d2ed-db0b2ac6-86c5-00089bb2144e/duck/duck.vxm' 81926 info] Turning off heartbeat checker" for what looks like each of the dead VMs.
It appears that high iSCSI I/O is the cause of these hangs, and that it can take out datastores other than the source and the destination. I'm still unsure as to the root cause being with my storage, ESXi code, or the hardware of the ESXi box, so please chip in if you have any thoughts on the matter...
The ESXi machine is a low-powered mini-itx motherboard (IEI KINO-690S1) with a 64-bit dual-core Turion and 4GB RAM. It has two onboard gigabit NICs. When I originally set the machine up, I used one NIC for iSCSI, and one for the VM Network. Since then, I put both iSCSI and VM traffic on the same NIC, as the frequency of reboots seemed to decrease when I did so. In another troubleshooting move, I stopped using either, and used an Intel PCI Pro/1000 GT Desktop gigabit NIC instead. This is now the only NIC in action. It obviously hasn't made any difference. I'm running 3.5.0 build 130755.
ESXi doesn't like either the IDE or SATA controller on the motherboard, so I boot from USB, and datastores are presented over iSCSI from a machine running OpenSolaris Nevada (SXCE) build 99. iSCSI targets on this machine are backed by ZFS volumes with thin-provisioning, transparent compression, and are assisted by a Samsung 32GB SSD "ZFS L2ARC" cache, which sits between RAM and disk. The pool of disks is called "zp", as I wasn't feeling very imaginative when I named it!
My troubleshooting leading up to the discovery of the VM hangs was as follows. I started when there wasn't any load on the Solaris zpool, as verified by "zpool iostat zp 1", and had discovered by this point that moving the file with 'mv' produced the "I/O error" message in the SSH window.
I restarted by means of hard reset, used SSH to connect to the ESXi box and ran the following in order to test the readability of the source datastore:
dd if=/vmfs/volumes/49035d9a-fdfaa11d-bb13-00089bb3144e/dingo/dingo-flat.vmdk of=/dev/null bs=1048576
The network graph in ESXi showed about 20MB/sec, and disk graph was around 18MB/sec. Quite low figures, I'd say, but at least it completed fine.
Now to pretend a write in the destination datastore:
dd if=/dev/zero of=/vmfs/volumes/49743cf6-be846eea-53da-00089bb3144f/dingo/testfile bs=1048576 count=10240
This also completed fine. Got over 30 MB/sec out of it, according to the ESXi performance graph.
dd if=/dev/urandom of=/vmfs/volumes/49743cf6-be846eea-53da-00089bb3144f/dingo/testfile bs=1048576 count=10240
Completed fine. This time around, the writes were 5 MB/sec, and CPU max-ed out at 50%. I'm guessing the slow speed is due to the calculation of random data. Shame I can't use any local storage as the input file. Note that in both of these tests, there isn't much writing done on the Solaris side. In the first case, thin provisioning and compression turn the 30MB/s stream of zeroes into next to nothing to write to disk, and in the second test, bandwidth is low (5 MB/s). Those familiar with ZFS's write activity will understand when I say that in the "urandom test" above, the writes occurred in bursts of 50 or 60MB rather than a continuous 5MB/sec.
To test a scenario where data is actually written to disk, I set up another volume, without compression and without thin provisioning. The following command simulated 30 MB/s of writes into that store:
dd if=/dev/zero of=/vmfs/volumes/497c8314-67505086-4e4f-00089bb3144e/nocompression bs=1048576 count=10240
This completed without a hitch, and my "zpool iostat zp 1" on the Solaris side showed bursty writes again. The graph in ESXi showed a steady 30MB/sec while writing.
So, I came to the conclusion that reads and writes are fine from an ESXi perspective, and the Solaris box can cope with whatever writes are given to it over the network. At this point I did the actual move which was failing with an "I/O error", but it actually worked. (see #2 in my explanation at the very top of this post). Very annoying.
mv /vmfs/volumes/49035d9a-fdfaa11d-bb13-00089bb3144e/dingo/dingo-flat.vmdk /vmfs/volumes/49743cf6-be846eea-53da-00089bb3144f/dingo
I then tried again with the remaining files I needed to move from that datastore (test #3):
mv /vmfs/volumes/49035d9a-fdfaa11d-bb13-00089bb3144e/test-dc2/ /vmfs/volumes/49743cf6-be846eea-53da-00089bb3144f/test-dc2/*
The graph started to show some activity in ESXi. Receive rate on the NIC was approx 9MB/s. Transmit rate was approx 18MB/s. I do wonder why it would be trying to transmit more that it had received. Any ideas? When moving these files, my SSH window just became unresponsive after about 10 minutes and the graph dropped to zero.
So, that's where I'm at. I may be jumping to conclusions here, but I think the hanging is a bug and has something to do with the interaction between the Solaris iSCSI target and the ESXi iSCSI initiator. The hardware isn't an amazing spec, but if it couldn't keep up, I think I'd expect my VMs to become sluggish, rather than the whole system losing datastores. iSCSI and LAN traffic are shared on the same gigabit switch, although in tests #1 - #3 there was no activity on the network other than iSCSI. From a windows box, I can do the similar type of operation to move data between two iSCSI-presented volumes, and this completes correctly.
However: I tried this just now and the ESXi machine hung again. I restarted the ESXi box and VMs were slow to load, as I expected when there is other disk I/O on the Solaris machine, but some actually hung in an odd fashion. My squid web cache, for example, was ping-able, but I couldn't SSH to it or use it to download web pages for that matter. Also there were unhealthy-looking gaps in the ESXi performance graph. Why would the system stop recording performance data?! It's now looking like R/W activity on the Solaris box or over the network in general is the culprit, although I'm still unsure as to why it would have the effect it does.
The only thing I have considered trying to remedy the situation is to install OpenSolaris 2008.11 (SNV_101b) or SXCE SVN_105 on the Solaris machine and see if that makes any difference. I would prefer to understand exactly what the problem is before I do that, however.
Thanks for any advice - I'm starting to become lost in my own troubleshooting!