VMware Cloud Community
lulu62
Enthusiast
Enthusiast

Datastore connectivity issues after big VM delete job

Hello,

I'm investigating a strange issue we experienced following the automated deletion of ~110 VMs.

Our infra is made of 30 Dell R740 ESXi 6.7 P01 hosts, one vCenter 6.7 U3b and one Kaminario K2 all-flash storage array. Connectivity is done with iSCSI multipath.

On April 16th, between 1:04pm and 1:06pm my colleague initiated the removal of ~110 VMs via gitlab/terraform.

A moment later, I start seeing these errors in vmkernel logs and hostd logs of our ESXi hosts:

Lots of "lost access to volume" and "Successfully restored access to volume" for all our LUNs.

2020-04-16T13:07:51.044Z info hostd[2100730] [Originator@6876 sub=Vimsvc.ha-eventmgr] Event 14158 : Lost access to volume 5e0e34c7-fe079ae4-880d-b02628657b90 (fake_vol_name_001) due to connectivity issues. Recovery attempt is in progress and outcome will be reported shortly.

2020-04-16T13:07:51.611Z info hostd[2100680] [Originator@6876 sub=Vimsvc.ha-eventmgr] Event 14159 : Successfully restored access to volume 5e0e34c7-fe079ae4-880d-b02628657b90 (fake_vol_name_001) following connectivity issues.

2020-04-16T13:08:04.051Z info hostd[2258728] [Originator@6876 sub=Vimsvc.ha-eventmgr] Event 14508 : Lost access to volume 5e846ee7-9eeaf25e-c425-b02628c83b80 (fake_vol_name_002) due to connectivity issues. Recovery attempt is in progress and outcome will be reported shortly.

2020-04-16T13:08:04.668Z info hostd[2256191] [Originator@6876 sub=Vimsvc.ha-eventmgr] Event 14509 : Successfully restored access to volume 5e846ee7-9eeaf25e-c425-b02628c83b80 (fake_vol_name_002) following connectivity issues.

Lots of failed 0x42 and 0x89 (UNMAP and COMPARE AND WRITE) scsi commands:

2020-04-16T13:08:43.433Z cpu42:2097285)ScsiDeviceIO: 3449: Cmd(0x459a969307c0) 0x89, CmdSN 0x9e2bfd from world 2097233 to dev "eui.0024f4008148000e" failed H:0x5 D:0x0 P:0x0 Invalid sense data: 0x0 0x0 0x0

2020-04-16T13:08:47.358Z cpu47:2098510)ScsiDeviceIO: 3399: Cmd(0x45a29262e7c0) 0x42, CmdSN 0x75a7d7 from world 3558716 to dev "eui.0024f4008148000d" failed H:0x8 D:0x0 P:0x0

2020-04-16T13:08:48.694Z cpu63:2098510)NMP: nmp_ThrottleLogForDevice:3802: Cmd 0x42 (0x45a28efa72c0, 3559157) to dev "eui.0024f400814801be" on path "vmhba64:C1:T0:L3" Failed: H:0x8 D:0x0 P:0x0 Invalid sense data: 0x0 0x0 0x0. Act:EVAL

Those errors repeated and lasted for more than an hour until 2:27pm.

I think the storage array can handle the deletion of 110 VMs almost all at once.

According to my network colleague we had no network outage during this timeframe.

I have a support case opened at Kaminario and I just created one today at VMware.

Any idea what could have happened?

0 Kudos
3 Replies
daphnissov
Immortal
Immortal

Just a guess but it's possible the resulting "delete" action on the K2 causing an UNMAP storm which wasn't appropriately throttled and choked all the processing away from the ESXi hosts. It could be the result of a microcode issue on the array, a defect or unoptimal config setting within ESXi, or a combination of both. You're doing the right thing by opening support cases with both vendors, however.

0 Kudos
IRIX201110141
Champion
Champion

Seen something similar related to scsi unmap.   A mass deleting (gigs of data and hundred thousand of files) inside the GuestOS (enabled by default in modern windows OS) will effecting the datastore/LUN and let the VM becomes unusable.

Working with VMware GSS and the Storage Vendor was NOT straight forward and the end we disable scsi unmap from GuestOS perspective as a workaround. VMware reports that some #PRs are pending in that area.

Regards,
Joerg

0 Kudos
lulu62
Enthusiast
Enthusiast

Thx, however i have a bad feeling about the support outcome (that's probably why I created this thread...)

With Kaminario, basically all i've been told so far was to check the datastore space reclamation settings (priority and reclamation rate) and adjust the unmap priority on the storage array.

The unmap priority on the storage array is currently set to the default value which gives full prio with unlimited bandwidth, whereas on the datastores it's also set to the default value but low priority and 100 MB/s.

I doubt playing with these settings could prevent failed scsi commands and datastore connectivity errors if the same scenario would reproduce in the future. Moreover the automated VM deletion job also removed VMs in a second vCenter in another datacenter (same infra/hardware) and over there we didn't get these errors.

0 Kudos