Re: NFS storage network 'hangs' - Page 2

mrnick1234567 · ‎12-29-2011

Hey,

I have a very frustrating problem with Linux VM's storage 'hanging'. We store our VM's on an EMC2 Isilon cluster accessed over NFS. The machines will frequently freeze for between 4 and 5 seconds. This affects all Linux VM's (CentOS5.5) on a particular ESXi host at the same time.

It seems to be related to the virtual disk controller, all VM's with the LSI controller exhibit the issue but a VM with the IDE one doesn't.

So far i've boiled it down to a very simple setup to recreate the issue:

1) Install ESXi 4.1U1 on a server or workstation connected to the network with a single 1gig link
2) Setup a VMkernel port for storage and management traffic
3) Setup a datastore on the isilon cluster mounted over NFS.
4) Create 2 CentOS 5.5 linux VM's with the LSI SCSI controller (they don't need network). Boot them into runlevel 1 (ie no network, minimum services).

5)
On one VM run ioping to measure latency to it's virtual disk. E.g.:
ioping -c 1000 /tmp

On the other VM, write some data to it's virtual own disk. E.g.:
dd if=/dev/zero of=/tmp/test bs=1024 count=40000

Most often when you run the dd command, a few seconds later both VM's will hang for 4-5 seconds. When it returns from hanging ioping always reports a ping time between 4000-5000ms. Both machines are frozen during this period, but the network is OK, I can still ping the ESXi host over the same link.

As I said using the IDE controllers seems to be a work around but it's not ideal. Interestingly I don't see the issue using the local disks as datastore, so it seems to be related to using NFS mounted datastores too.

I've tried updating with the latest patches using vCentre Update Manager.

Any ideas?

Nick

J1mbo · ‎03-01-2012

Out of interest, what queue depth is needed to produce the greatly extended latency?

There are some advanced NFS settings in v5 that might be worth a look, (NFS.) MaxConnPetIP, MaxQueueDepth (defaults to 4G), lo water mark too but already quite agressive at 25%. Check storage LAN latency, default send buffer is 256KB.

mrnick1234567 · ‎03-01-2012

HI Varjen,

Not exactly sure your problem is the same as the one I started this thread about

On your CentOS VM, install ioping and run

ioping -c 1000 /tmp

or write to some other directory on the virtual host's disk.

If it's the same issue, you should see ioping hang for 4-5 seconds every once in a while. I see it every minute or so.

I've tested with ESXi5 and don't have the issue, I only get it with 4.1, which makes me think your problem might be something different.

Incidentally I got on to VMware Support a month or so ago, and they haven't had any ideas either about this.

Nick

Varjen · ‎03-01-2012

I think it might be similar anyway.

The problem is not restricted to filetransfers but occurs when i try to format discs or do other IO heavy tasks.
But yes, i see now that there might be a difference to the problems.
My problems start about 10-15 seconds into heavy IO and continue as long as the task is running...

Yours seem to be more reoccuring?

I have been in touch with support too and got nothing from them.

Sorry if i hijacked your thread!

=T=

mrnick1234567 · ‎03-01-2012

That's OK. It would be good for you to try the ioping test.

If you don't get the issue then it is probably a different latency issue, and it would be good if you could start a new thread about it.

This is specific problem with regards to the storage network 'hanging', rather than high latency is general.

Cheers

Nick

Varjen · ‎03-01-2012

Jimbo, i have to say i think i might love you now.

I maxed out NFS.maxconn and send/recieve buffer and it works like a charm, i did get a little higher latency than baseline but totally acceptable.

=T=

Varjen · ‎03-01-2012

Hello again.

I reread your thread a couple of thoughts struck me...

Have you tried the other scsi controllers for your VM's?

Have you looked at the real size of your virtual discs? VM images stored on NFS datastores are always in thin mode as far as i can tell and thats not optimal. I dont know if any of these thoughts help anything...

mrnick1234567 · ‎03-01-2012

Thanks for the ideas. Yes we have tried the different controllers and no luck.

I dont think think the size of the disks is the problem here as we don't see the issue with Windows guests over NFS.

Cheers

Nick

cncs · ‎03-26-2012

I have been seeing the exact same issue since sometime last year, on the same version of ESXi.

My box is VMWare ESXi 4.1 running a Nexenta VM exporting ZFS back to VMWare as NFS, running about 8 lightly loaded Guest VMs.

I mainly noticed the latency issues due to the high load spikes (Considering the low utilization) that we were graphing in munin.

So I installed ioping on a couple of the Linux VMs and got the same odd latency spikes you were describing:

4096 bytes from /dev/sda (device 120.0 Gb): request=36889 time=1.1 ms
4096 bytes from /dev/sda (device 120.0 Gb): request=36890 time=1.2 ms
4096 bytes from /dev/sda (device 120.0 Gb): request=36891 time=4083.4 ms
4096 bytes from /dev/sda (device 120.0 Gb): request=36892 time=1.0 ms
4096 bytes from /dev/sda (device 120.0 Gb): request=36893 time=1.0 ms
4096 bytes from /dev/sda (device 120.0 Gb): request=36894 time=0.7 ms

The VMs:

Debian 6.0, kernel 2.6.32 with the "LSI Logic Parallel" adapter

CentOS 5.6, kernel 2.6.18 with the "LSI Logic Parallel" adapter

Mandriva 2010.2, kernel 2.6.33 with the "Paravirtual" adapter

I will probably upgrade to ESXi 5.0 eventually anyways, but will try out the NFS tuning you mentioned earlier in the thread first

Kim

mrnick1234567 · ‎03-29-2012

Hi Kim,

It's funny you mention that, we just got Boston Supermicro box running Nexenta to eval. It has the same issue you describe and as I see with our Isilon storage. I'm still talking to VMware support about it, but it's not been very quick getting anywhere. The lastest suggestion was to upgrade the NIC drivers as there were some later versions for the Broadcom NICs in our R610s, but this doesn't seem to have made any difference.

Let me know if you have any breakthroughs!

Nick

cncs · ‎03-29-2012

Hi Nick,

I upgraded to ESXi 5.0 after getting nowhere with the NFS optimizations, and the upgrade got rid of the issue. IO over NFS storage latency is much lower now across all the VMs. I don't get the 1-2ms latencies anylonger, I will need to work/tune that again, but overall I am much happier with the performance.

But the upgrade to ESXi5 got me on device passthrough - I needed to pass through a PCI device which worked on ESXi4.1 but not on 5 which causes the ESXi host to freeze on boot up. I was going to try out the newly released update for ESXi5 and cross my lucky stars, but then saw the forum posts about the bug preventing automatic machine startup on the ESXi host when running the free version 😞

I had to disable passthrough for that particular device, which affected mainly one VM guest, but as I mentioned before overall I am happy enough that I will move that particular VM guest to bare metal.

Regards,
Kim

jmerchan · ‎07-03-2012

Is there any hot fix to solve this problem? we don't want to move to vSphere 5

thanks

mrnick1234567 · ‎07-03-2012

I don't think there is. I had this issue logged with VMware support for months and eventually they came back to say they would have to close the case leaving it unresolved. Their only suggested fix was vsphere 5. We saw the issue on Isilon and Nexenta storage products, both vensors were aware of it.

All

NFS storage network 'hangs'