ESXi 6.5U1 got random Lost access to volume

giobuon · ‎10-10-2017

Hello,

My setup has a few ESXi nodes installed on HP DL380 G9, which have Intel 82599 10 Gigabit Dual Port for iSCSI to connect to a 3PAR 8200.

The Intel driver is lastest: ixgbe version 4.5.1

The 3PAR firmware is: 3.2.2 (MU4)

Each vmk adapter has their own DSwitch, Portgroup and physical NIC.

The physical NIC connect to dedicated 10Gbps switch, which have Jumbo frame enable.

DelayACK is disabled.

Recently I got many random event logs about:

Device naa.xxxxx performance has deteriorated. I/O latency increased from average value of 1832 microseconds to 19403 microsecond

Lost access to volume xxxxx (xxxx) due to connectivity issues. Recovery attempt is in progress and outcome will be reported shortly.

After few moment everything back to normal:

Successfully restored access to volume xxxxx (xxxx) following connectivity issues.

Device naa.xxxxx performance has improved. I/O latency reduced from 19403 microseconds to 1201 microseconds.

Checking the performance of physical NIC and storage found nothing abnormal. The throughput is actually low (about 100Mbps)

Could you please share your idea?

Thank you.

Finikiez · ‎10-10-2017

First of all check vmkernel.log for related messages to the time when you had an issue.

There should be some SCSI sence codes.

hussainbte · ‎10-10-2017

I dont find 3PAR 8200 in VMware HCL.. what is the version of ESXi?

If you found my answers useful please consider marking them as Correct OR Helpful Regards, Hussain https://virtualcubes.wordpress.com/

giobuon · ‎10-10-2017

I found nothing much on logs (I'm using Log Insight).

The vmwkernel.log has few lines at that time

2017-10-10T19:01:28.193Z cpu2:65603)WARNING: ScsiDeviceIO: 1188: Device naa. performance has deteriorated. I/O latency increased from average value of 517 microseconds to 10833 microseconds.

2017-10-10T19:01:40.573Z cpu0:65601)WARNING: ScsiDeviceIO: 1188: Device naa. performance has deteriorated. I/O latency increased from average value of 517 microseconds to 21811 microseconds.

2017-10-10T19:06:39.434Z cpu1:65602)ScsiDeviceIO: 1162: Device naa. performance has improved. I/O latency reduced from 21811 microseconds to 4255 microseconds.

2017-10-10T19:12:06.467Z cpu1:66639)WARNING: ScsiDeviceIO: 1188: Device naa. performance has deteriorated. I/O latency increased from average value of 518 microseconds to 13703 microseconds.

2017-10-10T19:14:40.404Z cpu30:66639)ScsiDeviceIO: 1162: Device naa. performance has improved. I/O latency reduced from 13703 microseconds to 2706 microseconds.

giobuon · ‎10-10-2017

Thank you for pointing out the HCL, I was thought 3PAR 8000 is the name of series . I'll check again with HP.

The version of ESXi is: 6.5 U1 build 5969303.

Finikiez · ‎10-11-2017

Try grep vmkernel log with naa id.

There should be SCSI errors like H:0x5, H:0x8.

Also check your SAN switches for dropped frames counter.

giobuon · ‎10-11-2017

Yes, there are some SCSI errors:

2017-10-10T09:23:22.784Z cpu1:66639)WARNING: ScsiDeviceIO: 1188: Device naa.id performance has deteriorated. I/O latency increased from average value of 515 microseconds to 13821 microseconds.
2017-10-10T09:23:31.916Z cpu18:66639)ScsiDeviceIO: 1162: Device naa.id performance has improved. I/O latency reduced from 13821 microseconds to 2703 microseconds.
2017-10-10T09:26:12.552Z cpu31:66637)ScsiDeviceIO: 1162: Device naa.id performance has improved. I/O latency reduced from 2703 microseconds to 1012 microseconds.
2017-10-10T09:27:01.343Z cpu17:66637)WARNING: ScsiDeviceIO: 1188: Device naa.id performance has deteriorated. I/O latency increased from average value of 515 microseconds to 10322 microseconds.
2017-10-10T09:27:01.767Z cpu9:66639)WARNING: ScsiDeviceIO: 1188: Device naa.id performance has deteriorated. I/O latency increased from average value of 515 microseconds to 21068 microseconds.
2017-10-10T09:27:07.140Z cpu1:66639)ScsiDeviceIO: 1162: Device naa.id performance has improved. I/O latency reduced from 21068 microseconds to 4137 microseconds.
2017-10-10T09:27:12.624Z cpu0:66639)WARNING: ScsiDeviceIO: 1188: Device naa.id performance has deteriorated. I/O latency increased from average value of 515 microseconds to 10908 microseconds.
2017-10-10T09:27:22.028Z cpu18:66637)ScsiDeviceIO: 1162: Device naa.id performance has improved. I/O latency reduced from 10908 microseconds to 2162 microseconds.
2017-10-10T09:27:40.457Z cpu0:66637)ScsiDeviceIO: 1162: Device naa.id performance has improved. I/O latency reduced from 2162 microseconds to 1013 microseconds.
2017-10-10T09:41:30.337Z cpu28:65629)ScsiDeviceIO: 2948: Cmd(0x439599510380) 0x85, CmdSN 0xec4 from world 68205 to dev "eui.00a0504658335330" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0.
2017-10-10T09:41:30.428Z cpu28:65629)ScsiDeviceIO: 2948: Cmd(0x439599510380) 0x85, CmdSN 0xec5 from world 68205 to dev "eui.00a0504658335331" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0.
2017-10-10T09:41:30.521Z cpu30:65631)ScsiDeviceIO: 2948: Cmd(0x439599510380) 0x85, CmdSN 0xec6 from world 68205 to dev "eui.00a0504658335332" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0.
2017-10-10T09:56:57.569Z cpu22:66637)WARNING: ScsiDeviceIO: 1188: Device naa.id performance has deteriorated. I/O latency increased from average value of 516 microseconds to 12305 microseconds.
2017-10-10T09:57:00.756Z cpu16:66637)ScsiDeviceIO: 1162: Device naa.id performance has improved. I/O latency reduced from 12305 microseconds to 2421 microseconds.
2017-10-10T09:57:42.679Z cpu0:66637)ScsiDeviceIO: 1162: Device naa.id performance has improved. I/O latency reduced from 2421 microseconds to 1014 microseconds.
2017-10-10T11:21:17.467Z cpu20:65621)ScsiDeviceIO: 2962: Cmd(0x439d892d31c0) 0x89, CmdSN 0xa90443 from world 65591 to dev "naa.id" failed H:0x5 D:0x0 P:0x0 Invalid sense data: 0x0 0x0 0x0.
2017-10-10T11:21:18.465Z cpu20:65621)ScsiDeviceIO: 2962: Cmd(0x439d8939b0c0) 0x89, CmdSN 0xa90445 from world 65591 to dev "naa.id" failed H:0x5 D:0x0 P:0x0 Invalid sense data: 0x0 0x0 0x0.
2017-10-10T11:21:18.465Z cpu23:65624)ScsiDeviceIO: 2912: Cmd(0x439581568980) 0x8a, CmdSN 0xf5 from world 982026 to dev "naa.id" failed H:0x8 D:0x0 P:0x0
2017-10-10T11:21:18.465Z cpu14:65664)NMP: nmp_ThrottleLogForDevice:3617: Cmd 0x8a (0x4395815f1880, 982026) to dev "naa.id" on path "vmhba64:C0:T0:L1" Failed: H:0x8 D:0x0 P:0x0 Invalid sense data: 0x0 0x0 0x0. Act:EVAL
2017-10-10T11:21:18.465Z cpu14:65664)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237: NMP device "naa.id" state in doubt; requested fast path state update...
2017-10-10T11:21:18.465Z cpu14:65664)ScsiDeviceIO: 2912: Cmd(0x4395815f1880) 0x8a, CmdSN 0x67 from world 982026 to dev "naa.id" failed H:0x8 D:0x0 P:0x0
2017-10-10T11:21:18.465Z cpu14:65664)ScsiDeviceIO: 2912: Cmd(0x439d8102d580) 0x8a, CmdSN 0xed from world 982026 to dev "naa.id" failed H:0x8 D:0x0 P:0x0
2017-10-10T11:21:18.465Z cpu14:65664)ScsiDeviceIO: 2912: Cmd(0x439d91de2a80) 0x8a, CmdSN 0x85 from world 982026 to dev "naa.id" failed H:0x8 D:0x0 P:0x0
2017-10-10T11:21:18.465Z cpu14:65664)ScsiDeviceIO: 2912: Cmd(0x439d8925e140) 0x8a, CmdSN 0xa7 from world 982026 to dev "naa.id" failed H:0x8 D:0x0 P:0x0
2017-10-10T11:21:18.465Z cpu14:65615)ScsiDeviceIO: 2912: Cmd(0x43959a767380) 0x8a, CmdSN 0xa1 from world 70813 to dev "naa.id" failed H:0x8 D:0x0 P:0x0
2017-10-10T11:21:18.465Z cpu14:65615)ScsiDeviceIO: 2912: Cmd(0x439591fe8c40) 0x8a, CmdSN 0xe2 from world 70813 to dev "naa.id" failed H:0x8 D:0x0 P:0x0
2017-10-10T11:21:18.465Z cpu14:65615)ScsiDeviceIO: 2912: Cmd(0x4395994a1c00) 0x8a, CmdSN 0xe7 from world 70813 to dev "naa.id" failed H:0x8 D:0x0 P:0x0
2017-10-10T11:21:18.465Z cpu14:65615)ScsiDeviceIO: 2912: Cmd(0x4395815c1680) 0x8a, CmdSN 0xea from world 70813 to dev "naa.id" failed H:0x8 D:0x0 P:0x0
2017-10-10T11:21:18.465Z cpu14:65615)ScsiDeviceIO: 2912: Cmd(0x439599489080) 0x8a, CmdSN 0xf3 from world 70813 to dev "naa.id" failed H:0x8 D:0x0 P:0x0
2017-10-10T11:21:18.465Z cpu21:65664)NMP: nmp_ThrottleLogForDevice:3546: last error status from device naa.id repeated 10 times
2017-10-10T11:21:18.465Z cpu21:65664)NMP: nmp_ThrottleLogForDevice:3546: last error status from device naa.id repeated 20 times
2017-10-10T11:21:18.465Z cpu21:65664)ScsiDeviceIO: 2912: Cmd(0x439d810cd780) 0x8a, CmdSN 0x80000034 from world 361129 to dev "naa.id" failed H:0x8 D:0x0 P:0x0
2017-10-10T11:21:18.465Z cpu21:65664)ScsiDeviceIO: 2912: Cmd(0x439d91ceb780) 0x8a, CmdSN 0x80000059 from world 361129 to dev "naa.id" failed H:0x8 D:0x0 P:0x0
2017-10-10T11:21:18.465Z cpu21:65664)ScsiDeviceIO: 2912: Cmd(0x439d81109d80) 0x8a, CmdSN 0x80000037 from world 361129 to dev "naa.id" failed H:0x8 D:0x0 P:0x0
2017-10-10T11:21:18.465Z cpu21:65664)ScsiDeviceIO: 2912: Cmd(0x439d810faf80) 0x8a, CmdSN 0x8000007f from world 361129 to dev "naa.id" failed H:0x8 D:0x0 P:0x0
2017-10-10T11:21:18.465Z cpu21:65664)ScsiDeviceIO: 2912: Cmd(0x439d91c9ac00) 0x8a, CmdSN 0x8000007c from world 361129 to dev "naa.id" failed H:0x8 D:0x0 P:0x0
2017-10-10T11:21:18.465Z cpu21:65664)ScsiDeviceIO: 2912: Cmd(0x439d91d70780) 0x2a, CmdSN 0xa90441 from world 65591 to dev "naa.id" failed H:0x8 D:0x0 P:0x0
2017-10-10T11:21:18.465Z cpu21:65664)ScsiDeviceIO: 2912: Cmd(0x439d893588c0) 0x2a, CmdSN 0xa90442 from world 65591 to dev "naa.id" failed H:0x8 D:0x0 P:0x0
2017-10-10T11:21:18.465Z cpu21:65664)ScsiDeviceIO: 2912: Cmd(0x439581509e00) 0x2a, CmdSN 0xa90444 from world 65591 to dev "naa.id" failed H:0x8 D:0x0 P:0x0
2017-10-10T11:21:18.466Z cpu21:65664)ScsiDeviceIO: 2912: Cmd(0x4395996df6c0) 0x2a, CmdSN 0xa90446 from world 65591 to dev "naa.id" failed H:0x8 D:0x0 P:0x0
2017-10-10T11:21:18.466Z cpu21:65664)ScsiDeviceIO: 2912: Cmd(0x439d8925f2c0) 0x8a, CmdSN 0x80000066 from world 74626 to dev "naa.id" failed H:0x8 D:0x0 P:0x0
2017-10-10T11:21:18.466Z cpu21:65664)ScsiDeviceIO: 2912: Cmd(0x43959a6baa00) 0x8a, CmdSN 0x80000008 from world 316498 to dev "naa.id" failed H:0x8 D:0x0 P:0x0
2017-10-10T11:21:18.466Z cpu21:65664)ScsiDeviceIO: 2912: Cmd(0x439591e76340) 0x8a, CmdSN 0x8000002c from world 316498 to dev "naa.id" failed H:0x8 D:0x0 P:0x0
2017-10-10T11:21:18.466Z cpu21:65664)ScsiDeviceIO: 2912: Cmd(0x439591fa83c0) 0x8a, CmdSN 0x80000022 from world 316498 to dev "naa.id" failed H:0x8 D:0x0 P:0x0
2017-10-10T11:21:18.466Z cpu21:65664)ScsiDeviceIO: 2912: Cmd(0x439591f4a2c0) 0x8a, CmdSN 0xc3 from world 70813 to dev "naa.id" failed H:0x8 D:0x0 P:0x0
2017-10-10T11:21:18.466Z cpu21:65664)ScsiDeviceIO: 2912: Cmd(0x4395994ec200) 0x8a, CmdSN 0xa9 from world 70813 to dev "naa.id" failed H:0x8 D:0x0 P:0x0
2017-10-10T11:21:18.466Z cpu21:65664)ScsiDeviceIO: 2912: Cmd(0x4395994ca000) 0x8a, CmdSN 0xd8 from world 70813 to dev "naa.id" failed H:0x8 D:0x0 P:0x0
2017-10-10T11:21:18.466Z cpu21:65664)ScsiDeviceIO: 2912: Cmd(0x439581462880) 0x8a, CmdSN 0x800e000f from world 316667 to dev "naa.id" failed H:0x8 D:0x0 P:0x0
2017-10-10T11:21:18.466Z cpu21:65664)ScsiDeviceIO: 2912: Cmd(0x43959976e440) 0x8a, CmdSN 0x800e006f from world 316667 to dev "naa.id" failed H:0x8 D:0x0 P:0x0
2017-10-10T11:21:18.932Z cpu0:70477)HBX: 2958: 'VMDS': HB at offset 3936256 - Waiting for timed out HB:
2017-10-10T11:21:18.932Z cpu0:70477) [HB state abcdef02 offset 3936256 gen 67 stampUS 2274494593778 uuid 59b9f987-36d0ee3c-1930-1402ec8fea10 jrnl <FB 38049600> drv 14.81 lockImpl 4 ip ip]
2017-10-10T11:21:22.581Z cpu21:65664)HBX: 283: 'VMDS': HB at offset 3936256 - Reclaimed heartbeat [Timeout]:
2017-10-10T11:21:22.581Z cpu21:65664) [HB state abcdef02 offset 3936256 gen 67 stampUS 2274510709027 uuid 59b9f987-36d0ee3c-1930-1402ec8fea10 jrnl <FB 38049600> drv 14.81 lockImpl 4 ip ip]
2017-10-10T11:21:51.582Z cpu13:66334)NMP: nmp_ResetDeviceLogThrottling:3348: last error status from device naa.id repeated 23 times

I'm using iSCSI, and there are no frame drop or error on iSCSI switch.

Finikiez · ‎10-11-2017

Ok

0x8a and 0x2a are SCSI WRITE commands

They failed with H:0x8 status.

H:0x8 means VMK_SCSI_HOST_RESET - This status is returned when the HBA driver aborts I/O. It also occurs if the HBA resets the target.

https://kb.vmware.com/kb/1029039

Do you use HW iSCSI initiator or Software iSCSI initiator?

Do you have the latest drivers installed for your network cards?

VMK_SCSI_HOST_RESET

giobuon · ‎10-11-2017

Thank you for the link.

I use software iSCSI initiator.

The Intel driver is latest: ixgbe version 4.5.1

FreddyFredFred · ‎10-11-2017

Don't have an answer for you but just wanted to say that I'm having similar issues. Running ESXi 6.0U2 with an 8200 on 3.2.2MU4. I get random lost access, regained access a few seconds later which seems to be loss of heartbeat related but there are no latency warnings that precede it though. The sense code is Hx02 I believe. I tried turning off the ATS heartbeat but no luck.

HP also discovered I was getting aborts with the Hx05 and 08. These were only logged in vmkernal.log, never made it to vCenter gui. per the VMware docs its an hbe abort but no clue why.

I'm also not running that much throughput and I have the same nics as you.

3par support does see I have some duplicateacks and aborts and are are saying it outside the array at this point. There's no clues in the VMware log that point to a definitive problem. Right now my only clue are some drops on my switch (it's a nexus and its reporting input drops on the San ports which I believe means the data going from San to server drops because server port won't accept it). I have a case open with Cisco to see if I can get some debug info but my next step is to try and tweak the qos on this switch. Failing that, I'll need to open a case with VMware and/Or Dell on the server side (also have a few 360g9 servers but I see it on the Dell servers mostly, because I have more of them, but it's completely random).

if I get more clues I'll post back

hussainbte · ‎10-11-2017

In some cases the H:0x5 and H:0x8 occurs after some device end check conditions as well which ultimately point to the array.

I don't want to highlight it again but I suggest you get in touch with vendor and make sure if the array supports the version of ESXi.

If this goes to VMware Supports thats the 1st thing they will check.

Another option is to try and run a supported version of ESXi and you might find the hyper-visor and SAN communication being absolutely fine.

If you found my answers useful please consider marking them as Correct OR Helpful Regards, Hussain https://virtualcubes.wordpress.com/

Finikiez · ‎10-12-2017

Ok, does it happen frequently? How many VMs are on the affected datastore?

Are all datastores on 3PAR affected? or just one?

Just want to draw your attention, that this is not a path failure. This is an issue with updating HB information by ESXi host.

BTW 3PAR is definitely supported

VMware Compatibility Guide - Storage/SAN Search

FreddyFredFred · ‎10-12-2017

Hi giobuon,

From your description it sounds like you're doing iSCSI, yes? If so, can you tell me what model switch you have and if you're using 1 switch or 2, 1 vlan or 2, if traffic is crossing the switches or is contained within the switch?

The two big things in common with our setups at the moment is the NIC and the 3par. Your network sounds like it might be configured bit better than mine (I'm running converged san/lan on the same vDS with network i/o control but NOT storage io control) but we're still facing a very similar issue with random drops/aborts.

I'm been fighting with this for a couple months and would really like to find the solution...

giobuon · ‎10-12-2017

Ok, does it happen frequently? How many VMs are on the affected datastore?
Are all datastores on 3PAR affected? or just one?

It's randomly, it reoccur after few hours or sometimes few days. I have about 100VMs on the datastore.

I have only this LUN on 3PAR so far.

BTW 3PAR is definitely supported

Does 3PAR 8000 is represent all 8000 series or only 8000 model?

Thank you.

giobuon · ‎10-12-2017

From your description it sounds like you're doing iSCSI, yes? If so, can you tell me what model switch you have and if you're using 1 switch or 2, 1 vlan or 2, if traffic is crossing the switches or is contained within the switch?

I have two separate 10Gbps Juniper EX4500 switches, so the traffic is isolated on each switch.

Finikiez · ‎10-13-2017

This row in HCL covers all arrays in 8000 series.

100 VMs is a big number for a single datastore especially if they all have thin-provisioned virtual disks. This might be an issue with updating HB and H:0x8 errors. How many ESXi hosts do you have connected to this LUN?

Do you have some free space on your storage to make a second LUN and migrate half of VMs to this new LUN?

giobuon · ‎10-23-2017

I have about ten ESXi hosts connect to the LUN. I can create new LUN but not sure if it can help since two LUNs share the same physical disks.

velocity08 · ‎06-09-2021

Hi giobuon

we've been seeing this as well.

did you ever find a solution to this issue?

We haev other San and NAS in the mix and can confirm this is directly related to the 3Par only.

any assistance or information would be greatly appreciated.

""CHeers

G

All

ESXi 6.5U1 got random Lost access to volume