VMware Cloud Community
mrcpld
Contributor
Contributor

Logs "Long VMFS3 rsv time on..."

Hello,
I’m running in the following trouble few weeks ago about my infrastructure, VMware vSphere 4.1 on HP blade BL460G1 and storage HP  EVA 6400.

Description:

The VM symptom is slowdown in accessing the disk, and an decrease of global performance on the datastore of infrastructure.

See below for more log details (/var/log/vmkernel):


Jan 27 19:25:39 esx500 vmkernel: 51:07:32:13.000 cpu4:4110)FS3: 8496: Long VMFS3 rsv time on 'GA3-OS-02' (held for 210 msecs). # R: 1, # W: 1 bytesXfer: 9 sectors

Jan 27 19:25:40 esx500 vmkernel: 51:07:32:14.268 cpu3:4112)FS3: 8496: Long VMFS3 rsv time on 'GA3-DATA-02-FATA' (held for 403 msecs). # R: 1, # W: 1 bytesXfer: 9 sectors

Jan 27 19:26:26 esx500 vmkernel: 51:07:32:59.594 cpu5:4111)FS3: 8496: Long VMFS3 rsv time on 'GA3-ISO-Template' (held for 436 msecs). # R: 1, # W: 1 bytesXfer: 3 sectors

Jan 27 19:27:00 esx500 vmkernel: 51:07:33:33.531 cpu7:4169)FS3: 8496: Long VMFS3 rsv time on 'GA3-ISO-Template' (held for 201 msecs). # R: 1, # W: 1 bytesXfer: 3 sectors

Jan 27 19:30:50 esx500 vmkernel: 51:07:37:23.675 cpu2:4169)FS3: 8496: Long VMFS3 rsv time on 'GA3-DATA-02-FATA' (held for 732 msecs). # R: 1, # W: 1 bytesXfer: 3 sectors

Jan 27 19:41:32 esx500 vmkernel: 51:07:48:05.655 cpu5:4111)FS3: 8496: Long VMFS3 rsv time on 'GA3-OS-03' (held for 301 msecs). # R: 1, # W: 1 bytesXfer: 5 sectors

any idea??

I’m stumped…

many thanks for any tips!

Reply
0 Kudos
13 Replies
DSTAVERT
Immortal
Immortal

Have a look at the following. http://kb.vmware.com/kb/1002598

May be related to delayed ACK.

-- David -- VMware Communities Moderator
Reply
0 Kudos
mrcpld
Contributor
Contributor

The Storage is a fiber chennel HP EVA 6400, i'm not using iSCSI protocol.
The servers are connected with a Qlogic FC card 4GB.

The KB refers to problem on ISCSI storage

thanks

Reply
0 Kudos
DSTAVERT
Immortal
Immortal

I realize that but the same may apply. I would use your VMware support to get some direct help with this.

-- David -- VMware Communities Moderator
Reply
0 Kudos
mrcpld
Contributor
Contributor

Thanks anyway,I will follow your advice
Like to know if anyone with a similar environment, complained the same problem.

Anyone??

Thanks

Reply
0 Kudos
AGratefulDad
Enthusiast
Enthusiast

Yes - very similar situation here only on Dell hardware.

We have submitted logs to VMWare with an SR - they have been NO help whatsoever - makes me wonder why I keep paying my subscription renewal.

Still have an open SR at this time, but no resolution.

I have to cycle our nodes in order for our Disk I/O to come back down to more manageable levels - I am almost leaning more towards a "locking" issue (to much I/O, not enough availability)

Twitter: @thevirtualguy
Reply
0 Kudos
SurfControl
Enthusiast
Enthusiast

we're having the same issue, it happens once every 6 to 7 weeks, we have ESXi 4.1 and EMC CX4 FC

we also see the following error prior to "long VMFS3 rsv time on..." error is logged

cpu0:4114)<6>qla2xxx 0000:07:00.0: scsi(5:0:21): Abort command issued -- 1 cdeb8c8 2002.

NMP: nmp_CompleteCommandForPath: Command 0x2a (0x41027f3a6d40) to NMP device "naa.600601602cxxxxxxxxxxxxxx" failed on physical path "vmhba2:C0:T0:L21" H:0x8 D:0x0 P:0x0 Possible sense data:0x0 0x0 0x0

Reply
0 Kudos
mrcpld
Contributor
Contributor

I opened the case in VMware and its appears to be a problem on the storage. We have a HP EVA 6400

the support says:

Unfortunately this is a SAN level issues that will require HP assistance to resolve. I also must note that the SAN not managing or releasing the storage locks from a performance perspective might indicate a more core issues on the storage that required HPs attention.

Jan 27 16:33:22 esx01 vmkernel: 16:23:57:54.674 cpu4:4100)NMP: nmp_CompleteCommandForPath: Command 0x28 (0x41027f9aad40) to NMP device "naa.6001438005debb0e0000800002810000" failed on physical path "vmhba2:C0:T1:L8" H:0x0 D:0x2 P:0x0 Valid sense data: 0x6 0x29 0x0 Jan 27 16:33:22 esx01 vmkernel: 16:23:57:54.674 cpu4:4100)ScsiDeviceIO: 1672: Command 0x28 to device "naa.6001438005debb0e0000800002810000" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x6 0x29 0x0.

Jan 27 16:33:22 esx01 vmkernel: 16:23:57:54.674 cpu4:4100)WARNING: ScsiCore: 1353: Power-on Reset occurred on naa.6001438005debb0e0000800002810000

Please contact  your storage vendor for further support on this Path Policy selection and recommendations.

    Device Display Name: HP Fibre Channel Disk (naa.6001438005debb0e0000800001eb0000)

    Storage Array Type: VMW_SATP_ALUA

    Storage Array Type Device Config: {implicit_support=on;explicit_support=on; explicit_allow=on;alua_followover=on;{TPG_id=2,TPG_state=ANO}{TPG_id=1,TPG_state=AO}}

    Path Selection Policy: VMW_PSP_RR

    Path Selection Policy Device Config: {policy=rr,iops=1000,bytes=10485760,useANO=0;lastPathIndex=1: NumIOsPending=0,numBytesPending=0}

    Working Paths: vmhba1:C0:T1:L4, vmhba2:C0:T1:L4, vmhba1:C0:T3:L4, vmhba2:C0:T3:L4

As per the logs provided the ESX host is a client of this Storage device and hence the configuration of the ESX appears correctly seeing the luns. If the path policy or zoning is incorrect or the storage controller firmware is not compatible or configured correctly this will result in poor communication with the ESX Qlogic controller. I am confident following involvement of the SAN Vendor that you will be able to address this storage performance issue more affectively.

and now i opened a case on supporto HP.....

Reply
0 Kudos
mcollins31
Contributor
Contributor

Was HP Support able to help you with this issue?  We are experiencing something similar and getting the "Long VMFS2 rsv time" errors on our IBM DS4700 SAN.

We just updated our SAN to the latest stable firmware/software from IBM and it still seems to be occuring.

Reply
0 Kudos
mrcpld
Contributor
Contributor

HP said that the SAN is OK, but the problem seems to be still in progress. I'm waiting to upgrade to vSphere 4.1 update 1 and update the FW SAN switch. what version you have? U1 ESXi or full?
bye

Reply
0 Kudos
mgossa
Contributor
Contributor

Hi,

I'm getting the same error.

Long VMFS3 rsv time on 'volume_name' (held for 616 msecs). # R: 1, # W: 1 bytesXfer: 9 sectors. We also get

nmp_CompleteCommandForPath: Command 0x2a (0x41027f3e8340) to NMP device "naa.6000eb39e12f3a6f000000000000003b" failed on physical path "vmhba33:C3:T23:L0" H:0x0 D:0x2 P:0x0 Valid sense data: 0x9 0x4 0x2.

We're using HP DL380 G5-G7 hosts running ESX 4.1 Update 1 (build 348481) and HP's Virtual SAN Appliance also with latest patches. We also have P4000 LHS and don't have the problem with those.

We've got support cases open with VMWare and HP but so far neither have found a solution. We also have issues with vRanger removing snapshots after backups if the VM is on the VSA storage. I'm sure this is linked - do you have a similar issue with taking snapshots? Do you get the second error above also?

Mark

Reply
0 Kudos
mcollins31
Contributor
Contributor

Sorry, I hadn't had a chance to reply to your question.  We are running ESX 4.1 build 260247 as well as vSphere 4.1 build 258902.

It looks like we may have (and I stress may have) found the reason why we are getting these error messages within ESX.  It looks like our SAN was unbalanced in the way our LUNs were distributed between the SAN Controllers (pretty much the balance was 90/10) a good portion of our VDI LUNs were on the heavily used controller.  After running performance monitors on the SAN for 48 hours, we have changed the preferred paths for a lot of our heavy I/O LUNs to balance it out better pretty close to 50/50 to both controllers.  Since we have done that about 3 days ago, we have not seen those "Long VMFS3 rsv time" errors as we were each and every day, even on weekends...

I will post back later this week and let you guys know if this resolves it for my situation.

Reply
0 Kudos
uklvirtual
Contributor
Contributor

Hi Guys,

did you manage to find a solution for the problem?

We also experience the same with our two clariions CX4 systems. It seems that problems occure since we increased the number of big datastores (more than 1.5TB) in our infrastructure.

Maybe this is related?

Any help is greatly appreciated!!

Regards,

daniel

Reply
0 Kudos
peetz
Leadership
Leadership

This message just indicates that a host has held a lock on a VMFS datastore for an unusually long time.

It is not necessarily an error message, but a clear sign for

a) there are too many hosts sharing the same datastores/LUNs (the more hosts access the LUN the higher is the potential for locking conflicts)

or/and

b) too many VMs residing on the same datastore (again increasing locking conflicts)

or/and

c) the I/O performance of the datastore/LUN has dropped under a ctritical threshold, so that metadata updates on the VMFS filesystem take longer than usual. For some of these metadata updates locks are needed, so these locks will be held for longer than usual.

For the third reason you will typically see these messages at times of high I/O load (e.g. running backup jobs or virus scans in multiple VMs).

So, you will need to analyze the load on your LUNs and probably re-think your hosts- and VMs-to-datastore ratios.

In your particular situation you should check if you are already using VAAI with your Clariions. This will offload the lock handling to the storage array and thus helps a lot with these issues. AFAIK VAAI is available with EMC's Flare Code 30. If you are not at that level, it is definitely worth upgrading to it.

Andreas

- Check out my VMware Front Experience Blog

Twitter: @VFrontDe, @ESXiPatches | https://esxi-patches.v-front.de | https://vibsdepot.v-front.de
Reply
0 Kudos