Very strange NFS behavior

GregMeathead · ‎06-02-2010

I have created an NFS export on out NetApp which i can see in vCenter no problem, it is added to both of my ESXi boxes.

I can create a new VM on this NFS no problem either.

But i can not upload anything to this datastore. If i browse it i can see all the folders and file in it but every time i try to upload an iso it give me an "I/O error occured" message and i can not upload anything.

well, sometime it works, but it takes forever, up to 30 min to upload 120mb file, (if i dont get an error).

I can upload to ESXi local storage no problem, about 15 sec for 120mb file, but i can not move it from local storage to NFS datastore, same error.

Any ideas?

BTW, my NFS storage is on a completely separate storage network and uses jumbo frames, all correctly configured on netapp, cisco switch and ESXi.

Im also using flow control, set to "send on" on netapp and ESXi and desired on cisco.

GregMeathead · ‎06-02-2010

Anybody has any clues?

GregMeathead · ‎06-04-2010

Ok, so ive figured out what it was, everything started to work as soon as i disabled jumbo frames on esxi (recreated vmkernel port group) and netapp side.

So now the question is why jumbo frames are killing the traffic?

Both of my stacked 2960S are configured for jumbo frames 9198 globally, on every interface (set system jumbo mtu 9198), net app vif was set to 9000 so was esxi port group, but somehow it didnt work.

How can i find out on which end the problem resides? why are jumbos killing my traffic.

MarkLomas · ‎06-28-2010

I too am getting this - Netapp NFS storage, ESXi - I/O Errors on trying to upload a large file.

I also have problems with snapshots - specifically deleting snapshots.

It's not always a problem but following the Netapp best practices does not reveal and 'magic button' that I haven't pressed.

I've installed the Netapp 'Virtual Storage Console' plugin to my vCenter, and had it apply the 'recommended settings' to both my ESXi servers, but this hasn't made a difference.

I will try disabling Jumbo Frames everywhere and see if this also alleciates my problem.

Very interested in seeing where this thread goes - the problem is proving rather frustrating, specifically on the snapshot front, as it's imparing our backup routine!

--

Mark Lomas

MarkLomas · ‎06-28-2010

Quick update - seems I never even had Jumbo Frames enabled! Problem still remans though.

dbroome001 · ‎06-28-2010

I'm having the same issue. jumbo frames enabled all around (9000 on ESX & on NetApps; 9198 on 3750Xs), can ping -s 9000 to vlan vif, but cannot vmkping -s 9000 vlan vif from my host. Any ideas out there?

GregMeathead · ‎06-28-2010

I have spend 4 hours with VMware and Cisco support on the line to fix it and this is what we have found out:

I had 4 NICs on each ESX, 2 dedicated to storage (9000 MTU vm kernels) and 2 to my normal data net (1500 MTU, one vmkernel port group for console access).

after ton of testing and troubleshooting we have find out that ESX have to have ALL VMKERNELS (no matter what vSwitch) on the same MTU.

Since i could not put all of my data net on jumbo i have to put storage net on 1500 MTU, after that everything worked great.

dbroome001 · ‎06-28-2010

Ok, thanks

I do have the option to put everything on MTU 9000, so will config and test and update all.

GregMeathead · ‎06-28-2010

Remember, that ALL endpoints have to be on jumbo, that means all desktop, laptop clients too if you put data net on 9000.

dbroome001 · ‎06-28-2010

Just realized something, your Service Console is vswif0, not

vmknic(x), the only only other vmknic would be for vMotion operations.

I

have all my vmknics set to 9000mtu as well as all my vswitches set to

9000 mtu (even those with SC vswif0 and VM traffic, neither of which is a vmknic) and still no luck... Luckily I'm still in build phase, so I can do this w/o whacking everything out, but either way, it still didn't work... Looks like I'll be placing my own call to VMware...

mlubinski · ‎06-28-2010

Hi, I had the same problem, when jumbo frames was enabled on storage and ESX hosts, but not on "main" CISCO config. i guess it was system MTU that should have been set to 9000. After setting this correctly problem was solved. I had the same errors (I/O Error) as you have.

If you found this or any other answer useful please consider the use of the Helpful or correct buttons to award points

[I]If you found this or any other answer useful please consider the use of the Helpful or correct buttons to award points[/I]

MarkLomas · ‎06-29-2010

Don't have Jumbo frames enabled here, neither switch nor ESXi hosts nor Netapp - still having the I/O Error though.

mlubinski · ‎06-29-2010

do you see any warnings/errors in /var/log/vmkernel log? or vmkwarning?

and on netapp log? /etc/messages

If you found this or any other answer useful please consider the use of the Helpful or correct buttons to award points

[I]If you found this or any other answer useful please consider the use of the Helpful or correct buttons to award points[/I]

MarkLomas · ‎06-30-2010

Yes, i'm on ESXi, so it's in the 'messages' log - here's the only relevent line I see:

Jun 30 13:20:55 vmkernel: 6:18:26:03.136 cpu1:4196)BC: 3582: Failed to flush 128 buffers of size 8192 each for object '' b00f 36 0 40 f87c2c 20 59c0e600 5c98cf 1126c44 40 100f87c2c c231fbca00000001 4100 c231fa0800000000: I/O error

I don't see anything out of the ordinary on the Netapp messages log, but I am not sure if the NFS logging level could be higher.

--

Mark Lomas

MarkLomas · ‎06-30-2010

In addition to my update on the error on the ESXi messages log, I have now found something on the Netapp filer of interest - when the problem occurs, the 'badcalls' and 'xdrcall' stats for NFS both increment by 1.

mlubinski · ‎06-30-2010

One more suggestion

Please unmount this nfs mount from viclient and then mount it

again,but from esx host direcy with commend esxcfg-nas and check if it

helps

Sent from my iPhone

On Jun 30, 2010, at 4:01 PM, MarkLomas <communities-

[I]If you found this or any other answer useful please consider the use of the Helpful or correct buttons to award points[/I]

mlubinski · ‎06-30-2010

Then I guess you should speak to netapp support to see if everything

is setup correctly

Sent from my iPhone

On Jun 30, 2010, at 4:01 PM, MarkLomas <communities-

[I]If you found this or any other answer useful please consider the use of the Helpful or correct buttons to award points[/I]

MarkLomas · ‎07-01-2010

Well, as I understand it, the badcalls / xdrcalls is usually down to host / filer communication.

Are there any NFS settings that might be relevant on the ESXi side?

--

Mark Lomas

All

Very strange NFS behavior