VMware Cloud Community
cjmurray
Contributor
Contributor

VMs in iSCSI datastores hanging after high disk I/O

Hi all,

For a while now I've been mildly annoyed by an intermittent issue that I've been having in my ESXi home lab setup, where certain VMs become unresponsive, or the interface through VMware Infrastructure Client becomes unresponsive ... or both. The only resolution is to restart the entire host. Sometimes the box can go over a week without needing a reboot, and sometimes it will only last a few hours. After a particularly bad day of needing to be rebooted about 4 or 5 times, I started to look into this a little deeper, and I've been able to reproduce what looks like the same issue, although still intermittently.

Allow me to briefly explain my tests, and the outcomes:

1. Moving a file from one datastore to another when no VMs are powered on, using the 'mv' command in SSH session. After a while I get "I/O error", and whole box hangs.

2. Moving THE SAME FILE as above, in the same circumstances one day later. Succeeds. Odd.

3. Moving another file between the same two datastores. 10 minutes in, and the SSH window hasn't returned to the command prompt. Causes my other SSH window to become very sluggish, the VI Client becomes unresponsive, and I get "The request failed because the remote server took too long to respond." error when I try to do just about anything. There aren't any errors in the SSH session this time, however.

So I seem to have three different outcomes for what I consider to be the same action. When I try #3 again with VMs powered up this time, lo and behold they become dead to the world after about 10 minutes. The System Logs have a number of "Failed to validate VM IP address: unknown" errors in them, along with lines similar to "vm:/vmfs/volumes/4900d2ed-db0b2ac6-86c5-00089bb2144e/duck/duck.vxm' 81926 info] Turning off heartbeat checker" for what looks like each of the dead VMs.

It appears that high iSCSI I/O is the cause of these hangs, and that it can take out datastores other than the source and the destination. I'm still unsure as to the root cause being with my storage, ESXi code, or the hardware of the ESXi box, so please chip in if you have any thoughts on the matter...

The ESXi machine is a low-powered mini-itx motherboard (IEI KINO-690S1) with a 64-bit dual-core Turion and 4GB RAM. It has two onboard gigabit NICs. When I originally set the machine up, I used one NIC for iSCSI, and one for the VM Network. Since then, I put both iSCSI and VM traffic on the same NIC, as the frequency of reboots seemed to decrease when I did so. In another troubleshooting move, I stopped using either, and used an Intel PCI Pro/1000 GT Desktop gigabit NIC instead. This is now the only NIC in action. It obviously hasn't made any difference. I'm running 3.5.0 build 130755.

ESXi doesn't like either the IDE or SATA controller on the motherboard, so I boot from USB, and datastores are presented over iSCSI from a machine running OpenSolaris Nevada (SXCE) build 99. iSCSI targets on this machine are backed by ZFS volumes with thin-provisioning, transparent compression, and are assisted by a Samsung 32GB SSD "ZFS L2ARC" cache, which sits between RAM and disk. The pool of disks is called "zp", as I wasn't feeling very imaginative when I named it!

My troubleshooting leading up to the discovery of the VM hangs was as follows. I started when there wasn't any load on the Solaris zpool, as verified by "zpool iostat zp 1", and had discovered by this point that moving the file with 'mv' produced the "I/O error" message in the SSH window.

I restarted by means of hard reset, used SSH to connect to the ESXi box and ran the following in order to test the readability of the source datastore:

dd if=/vmfs/volumes/49035d9a-fdfaa11d-bb13-00089bb3144e/dingo/dingo-flat.vmdk of=/dev/null bs=1048576

The network graph in ESXi showed about 20MB/sec, and disk graph was around 18MB/sec. Quite low figures, I'd say, but at least it completed fine.

Now to pretend a write in the destination datastore:

dd if=/dev/zero of=/vmfs/volumes/49743cf6-be846eea-53da-00089bb3144f/dingo/testfile bs=1048576 count=10240

This also completed fine. Got over 30 MB/sec out of it, according to the ESXi performance graph.

dd if=/dev/urandom of=/vmfs/volumes/49743cf6-be846eea-53da-00089bb3144f/dingo/testfile bs=1048576 count=10240

Completed fine. This time around, the writes were 5 MB/sec, and CPU max-ed out at 50%. I'm guessing the slow speed is due to the calculation of random data. Shame I can't use any local storage as the input file. Note that in both of these tests, there isn't much writing done on the Solaris side. In the first case, thin provisioning and compression turn the 30MB/s stream of zeroes into next to nothing to write to disk, and in the second test, bandwidth is low (5 MB/s). Those familiar with ZFS's write activity will understand when I say that in the "urandom test" above, the writes occurred in bursts of 50 or 60MB rather than a continuous 5MB/sec.

To test a scenario where data is actually written to disk, I set up another volume, without compression and without thin provisioning. The following command simulated 30 MB/s of writes into that store:

dd if=/dev/zero of=/vmfs/volumes/497c8314-67505086-4e4f-00089bb3144e/nocompression bs=1048576 count=10240

This completed without a hitch, and my "zpool iostat zp 1" on the Solaris side showed bursty writes again. The graph in ESXi showed a steady 30MB/sec while writing.

So, I came to the conclusion that reads and writes are fine from an ESXi perspective, and the Solaris box can cope with whatever writes are given to it over the network. At this point I did the actual move which was failing with an "I/O error", but it actually worked. (see #2 in my explanation at the very top of this post). Very annoying.

mv /vmfs/volumes/49035d9a-fdfaa11d-bb13-00089bb3144e/dingo/dingo-flat.vmdk /vmfs/volumes/49743cf6-be846eea-53da-00089bb3144f/dingo

I then tried again with the remaining files I needed to move from that datastore (test #3):

mv /vmfs/volumes/49035d9a-fdfaa11d-bb13-00089bb3144e/test-dc2/ /vmfs/volumes/49743cf6-be846eea-53da-00089bb3144f/test-dc2/*

The graph started to show some activity in ESXi. Receive rate on the NIC was approx 9MB/s. Transmit rate was approx 18MB/s. I do wonder why it would be trying to transmit more that it had received. Any ideas? When moving these files, my SSH window just became unresponsive after about 10 minutes and the graph dropped to zero.

So, that's where I'm at. I may be jumping to conclusions here, but I think the hanging is a bug and has something to do with the interaction between the Solaris iSCSI target and the ESXi iSCSI initiator. The hardware isn't an amazing spec, but if it couldn't keep up, I think I'd expect my VMs to become sluggish, rather than the whole system losing datastores. iSCSI and LAN traffic are shared on the same gigabit switch, although in tests #1 - #3 there was no activity on the network other than iSCSI. From a windows box, I can do the similar type of operation to move data between two iSCSI-presented volumes, and this completes correctly.

However: I tried this just now and the ESXi machine hung again. I restarted the ESXi box and VMs were slow to load, as I expected when there is other disk I/O on the Solaris machine, but some actually hung in an odd fashion. My squid web cache, for example, was ping-able, but I couldn't SSH to it or use it to download web pages for that matter. Also there were unhealthy-looking gaps in the ESXi performance graph. Why would the system stop recording performance data?! It's now looking like R/W activity on the Solaris box or over the network in general is the culprit, although I'm still unsure as to why it would have the effect it does.

The only thing I have considered trying to remedy the situation is to install OpenSolaris 2008.11 (SNV_101b) or SXCE SVN_105 on the Solaris machine and see if that makes any difference. I would prefer to understand exactly what the problem is before I do that, however.

Thanks for any advice - I'm starting to become lost in my own troubleshooting!

Chris

Reply
0 Kudos
5 Replies
PaulFF
Contributor
Contributor

I'm having a similar problem. I made a thread here: http://communities.vmware.com//thread/197457?tstart=0

This happens with both linux and windows for us.

Reply
0 Kudos
cjmurray
Contributor
Contributor

Hi Paul,

Unfortunately not, but things are a little better than they were. The plot thickened somewhat as I investigated further (as if the post above isn't confusing enough!!), and it's now looking like heavy writes to the solaris box in general are the culprit.

After messing around with switches (gigabit and 100 mbit), I found that with 100 mbit, the whole setup appeared to be more stable. For a while I also suspected that it was the sharing of iSCSI and network traffic that was the problem, until I discovered that performing writes as fast as disk will allow (DD directly into the zpool on the Solaris machine) led to the same hanging on ESXi. Pretty quickly too. If I ran a "zpool iostat 1" while the writes were happening, the usual zero write I/O was displayed while writes were being cached, until a 5 second bursts of writes occurred at full speed. See this link for an explanation of how the write throttle now works post OpenSolaris SNV87:

If I play a video from the same pool (home lab, remember) while I run this test, the video will stutter while the writes are being flushed. The iostat output shows that reads drop to almost zero while the writes are taking place. I'm thinking that ESXi doesn't like the datastore reads being blocked for a number of seconds and this is what induces the drop of the datastores and the start of the hanging . I'm now on SNV104 on the solaris side, by the way.

Now, for the awful, awful ‘solution'. A script which runs on runlevel 3 on the Solaris machine:

while true

do

sync

sleep 5

done

Writes are flushed to disk more frequently, therefore the bursts are shorter, and the whole setup is more stable. It's far from ideal, however, so I'm waiting on the release of Sun xVM Server so I can switch over to that. I'll be able to house the storage (and the 32GB SSD cache) in the same machine instead of having to use iSCSI. When ZFS matures with de-dupe features, I'll be able to take advantage of those too, all as part of the xVM Server solution. It's a shame; I'd love to stay with VMware products at home as we use ESX at my workplace and I like to play around and learn lessons at home rather than risk it at work!

Chris

Reply
0 Kudos
cjmurray
Contributor
Contributor

Ah, I see from your post that you're using a NetApp box. I thought from your last post you were using Solaris like I am. It seems you're observing similar symptoms, but with different hardware. Perhaps it is a bug on the ESX side? Since I don't run server class hardware I don't have much hope in it being solved for my own setup.

Given how intermittent mine was, my only suggestion is to induce heavy writes on the NetApp side, and see if ESX still reacts the same way. Maybe that will give some clue as to where the problem is?

I'll be watching your thread for ideas too!

Chris

Reply
0 Kudos
stephansamit
Contributor
Contributor

hi dis is amit,

Im facing the same kind of problem.

My setup is - ESX 4.1 host and im running 8 vm with red hat 6 enterprise version on them.

Im having ISCSI setup for datastore..

When im running IOZONE test for IO benchmarking, some of the vm are not responding after some time and im getting errors in vmware.log

Jan 24 05:22:47.278: vcpu-0| PVSCSI: scsi0:0: aborting cmd 0xf7
Jan 24 05:22:47.278: vcpu-0| PVSCSI: scsi0:0: aborting cmd 0xd2
Jan 24 05:22:47.279: vcpu-0| PVSCSI: scsi0:0: aborting cmd 0xf1
Jan 24 05:22:47.279: vcpu-0| PVSCSI: scsi0:0: aborting cmd 0xf3
Jan 24 05:22:47.280: vcpu-0| PVSCSI: scsi0:0: aborting cmd 0xcb
Jan 24 05:22:47.280: vcpu-0| PVSCSI: scsi0:0: aborting cmd 0xfc
Jan 24 05:22:47.280: vcpu-0| PVSCSI: scsi0:0: aborting cmd 0xcf
Jan 24 05:22:47.281: vcpu-0| PVSCSI: scsi0:0: aborting cmd 0xe1
Jan 24 05:28:31.268: vcpu-0| PVSCSI: scsi0:0: aborting cmd 0xbf
Jan 24 05:28:31.269: vcpu-0| PVSCSI: scsi0:0: aborting cmd 0xd4
Jan 24 05:28:31.269: vcpu-0| PVSCSI: scsi0:0: aborting cmd 0xeb
Jan 24 05:32:09.277: vcpu-0| PVSCSI: scsi0:0: aborting cmd 0xe4
Jan 24 05:32:09.277: vcpu-0| PVSCSI: scsi0:0: aborting cmd 0xc9
Jan 24 05:32:09.277: vcpu-0| PVSCSI: scsi0:0: aborting cmd 0xc6
Jan 24 05:32:09.278: vcpu-0| PVSCSI: scsi0:0: aborting cmd 0xc4
Jan 24 05:32:09.278: vcpu-0| PVSCSI: scsi0:0: aborting cmd 0xc2
Jan 24 05:32:09.279: vcpu-0| PVSCSI: scsi0:0: aborting cmd 0xe2
Jan 24 05:32:09.279: vcpu-0| PVSCSI: scsi0:0: aborting cmd 0xc7

Why is this happening...Is it because of saturation at san network or some limitation of ESX server..ar of redhat os..

Reply
0 Kudos
computerguy7
Enthusiast
Enthusiast

I am getting the same errors

2014-05-07T21:58:35.567Z| vcpu-1| I120: PVSCSI: scsi1:0: aborting cmd 0xfd

2014-05-07T21:58:35.567Z| vcpu-1| I120: PVSCSI: scsi1:0: aborting cmd 0xec

2014-05-07T21:58:35.567Z| vcpu-1| I120: PVSCSI: scsi1:0: aborting cmd 0xf1

2014-05-07T21:58:35.567Z| vcpu-1| I120: PVSCSI: scsi1:0: aborting cmd 0xde

2014-05-07T21:58:35.567Z| vcpu-1| I120: PVSCSI: scsi1:0: aborting cmd 0xe8

2014-05-07T21:58:35.567Z| vcpu-1| I120: PVSCSI: scsi1:0: aborting cmd 0xd9

2014-05-07T21:58:35.567Z| vcpu-1| I120: PVSCSI: scsi1:0: aborting cmd 0xd2

2014-05-07T21:58:35.567Z| vcpu-1| I120: PVSCSI: scsi1:0: aborting cmd 0xd7

2014-05-07T21:58:35.567Z| vcpu-1| I120: PVSCSI: scsi1:0: aborting cmd 0xd8

2014-05-07T21:58:35.567Z| vcpu-1| I120: PVSCSI: scsi1:0: aborting cmd 0xf3

2014-05-07T21:58:35.567Z| vcpu-1| I120: PVSCSI: scsi1:0: aborting cmd 0xf7

2014-05-07T21:58:35.568Z| vcpu-1| I120: PVSCSI: scsi1:0: aborting cmd 0xf2

hopefully vmware will tell me why when they call me tomorrow.

Reply
0 Kudos