VMware Cloud Community
Grim192
Enthusiast
Enthusiast

Disk claim issue

My lab has 3x HP GEN 10 Microservers, each containing 1x SSD and 3x HDDs.

I was previously running VSAN on v6.5 fine until I decided to upgrade to v6.7. The vCenter upgraded fine but the hosts kept failing, so I decided to just wipe the hosts and the storage and start fresh. Each of the hosts were freshly installed with v6.7, disks wiped, vmkernel ports configured etc. When I created the disk groups hosts 2 and 3 completed successfully but host 1 hanged and eventually failed. If I rebooted host 1 the boot would hang until I removed the SSD and then it booted successfully. I then had to wipe the SSD before the host would boot again. I've tried swapping the disks between the hosts and rebuilding the disk groups but the problem only affects host 1 and the SSD.

I've found this in the logs but I'm confused why it affects only 1 host on v6.7

vmkwarning.log

2018-12-29T23:54:01.484Z cpu1:2097640)WARNING: vmw_ahci[00000100]: IssueCommand:ERROR: Tag 1 SActive already set: SACT:6 CI:7 reissue_flag:0

2018-12-29T23:54:01.484Z cpu0:2097736)WARNING: NMP: nmpCompleteRetryForPath:357: Retry cmd 0x28 (0x459a40cba2c0) to dev "t10.ATA_____V42DCT064V4SSD2__________________________200105984___________" failed on path "vmhba1:C0:T0:L0" H:0x1 D:0x2 P:0x$

2018-12-29T23:54:01.484Z cpu0:2097736)WARNING: NMP: nmpCompleteRetryForPath:387: Logical device "t10.ATA_____V42DCT064V4SSD2__________________________200105984___________": awaiting fast path state update before retrying failed command again...

2018-12-29T23:54:02.486Z cpu2:2097640)WARNING: NMP: nmpDeviceAttemptFailover:640: Retry world failover device "t10.ATA_____V42DCT064V4SSD2__________________________200105984___________" - issuing command 0x459a40cba2c0

vmkernel.log

2018-12-30T00:03:55.487Z cpu0:2097640)0x451a0a71b9d0:[0x41801a5f1a03]ahciRequestIo@(vmw_ahci)#<None>+0x4ac stack: 0x4306c3654a88, 0x4306c3654990, 0x451a0a71bb68, 0x0, 0x0

2018-12-30T00:03:55.487Z cpu0:2097640)0x451a0a71ba50:[0x41801a5f7f38]scsiExecReadWriteCommand@(vmw_ahci)#<None>+0x51 stack: 0x4306c3657de8, 0x41801a5f816e, 0x4306c36549c8, 0x41801a5f82fa, 0x451a0a71bb58

2018-12-30T00:03:55.487Z cpu0:2097640)0x451a0a71ba70:[0x41801a5f816d]ataIssueCommand@(vmw_ahci)#<None>+0x46 stack: 0x451a0a71bb58, 0x10005893c29a88f, 0x0, 0x459a40d2cb00, 0x4302fb4dbd40

2018-12-30T00:03:55.487Z cpu0:2097640)0x451a0a71ba80:[0x41801a5f82f9]scsiQueueCommand@(vmw_ahci)#<None>+0xd2 stack: 0x0, 0x459a40d2cb00, 0x4302fb4dbd40, 0x0, 0x459a40d2c8c0

2018-12-30T00:03:55.487Z cpu0:2097640)0x451a0a71bac0:[0x418019f78ceb]SCSIIssueCommandDirect@vmkernel#nover+0xf8 stack: 0x15, 0x41801a5f8228, 0x418019f78cd5, 0x451a0a71bf80, 0x451a0a71bb68

2018-12-30T00:03:55.487Z cpu0:2097640)0x451a0a71bb30:[0x418019f7a01f]SCSIStartAdapterCommands@vmkernel#nover+0x384 stack: 0x4302fb4dbec0, 0x40000108, 0x4302fb4dc2b0, 0x4302fb4dc238, 0x451a00000001

2018-12-30T00:03:55.487Z cpu0:2097640)0x451a0a71bbc0:[0x418019f8924b]SCSIStartPathCommands@vmkernel#nover+0x4d8 stack: 0x451a075a3000, 0x418019f01124, 0x0, 0x58900000000, 0x451a0a71bcf0

2018-12-30T00:03:55.487Z cpu0:2097640)0x451a0a71bd70:[0x418019f8fc0f]SCSIIssueAsyncPathCommandDirect@vmkernel#nover+0x240 stack: 0x3436305443443234, 0x5f5f324453533456, 0x5f5f5f5f5f5f5f5f, 0x5f5f5f5f5f5f5f5f, 0x41801a8115c4

2018-12-30T00:03:55.487Z cpu0:2097640)0x451a0a71be30:[0x418019f9153b]vmk_ScsiIssueAsyncPathCommandDirect@vmkernel#nover+0x1c stack: 0x4308b09d3ad0, 0x41801a7d7726, 0x1, 0xd120000000030, 0x451a0a71bec8

2018-12-30T00:03:55.487Z cpu0:2097640)0x451a0a71be50:[0x41801a7d7725]nmp_SelectPathAndIssueCommand@com.vmware.vmkapi#v2_5_0_0+0xea stack: 0x451a0a71bec8, 0x451a0a71be70, 0x0, 0x4308b09d3ad0, 0x459a40d2c440

2018-12-30T00:03:55.487Z cpu0:2097640)0x451a0a71beb0:[0x41801a7d2c81]nmpAttemptFailover@com.vmware.vmkapi#v2_5_0_0+0x102 stack: 0x459a40d2c140, 0x4308b09aa098, 0xffffffff, 0x4308b09d4590, 0x43007e6c5070

2018-12-30T00:03:55.487Z cpu0:2097640)0x451a0a71bf30:[0x418019cea442]HelperQueueFunc@vmkernel#nover+0x30f stack: 0x4308b09b37a8, 0x4308b09b3798, 0x4308b09b37d0, 0x451a0a723000, 0x4308b09b37a8

2018-12-30T00:03:55.487Z cpu0:2097640)0x451a0a71bfe0:[0x418019f09112]CpuSched_StartWorld@vmkernel#nover+0x77 stack: 0x0, 0x0, 0x0, 0x0, 0x0

2018-12-30T00:03:55.487Z cpu0:2097175)WARNING: NMP: nmpCompleteRetryForPath:357: Retry cmd 0x28 (0x459a40d2c140) to dev "t10.ATA_____V42DCT064V4SSD2__________________________200105984___________" failed on path "vmhba1:C0:T0:L0" H:0x1 D:0x2 P:0x$

2018-12-30T00:03:55.487Z cpu0:2097175)WARNING: NMP: nmpCompleteRetryForPath:387: Logical device "t10.ATA_____V42DCT064V4SSD2__________________________200105984___________": awaiting fast path state update before retrying failed command again...

2018-12-30T00:03:56.486Z cpu1:2097640)WARNING: NMP: nmpDeviceAttemptFailover:640: Retry world failover device "t10.ATA_____V42DCT064V4SSD2__________________________200105984___________" - issuing command 0x459a40d2c140

2018-12-30T00:03:56.486Z cpu1:2097640)WARNING: vmw_ahci[00000100]: IssueCommand:ERROR: Tag 1 SActive already set: SACT:6 CI:7 reissue_flag:0

0 Kudos
2 Replies
TheBobkin
Champion
Champion

Hello Grim192

"I decided to upgrade to v6.7. The vCenter upgraded fine but the hosts kept failing, so I decided to just wipe the hosts and the storage and start fresh. Each of the hosts were freshly installed with v6.7"

Please confirm the hosts and vCenter are on the same version e.g. both on 6.7 GA/pre-U1 or both on 6.7 U1 - I say this as there have been a LOT of backend changes made in 6.7 U1 and as a result have seen some loopy behaviour when trying to create/administer 6.7 U1 vSAN clusters with a 6.7 vCenter (they are incompatible and unsupported).

"If I rebooted host 1 the boot would hang until I removed the SSD and then it booted successfully"

Where did this get stuck? In the module pre-load or during after this (and at what module)?

"2018-12-30T00:03:55.487Z cpu0:2097175)WARNING: NMP: nmpCompleteRetryForPath:357: Retry cmd 0x28 (0x459a40d2c140) to dev "t10.ATA_____V42DCT064V4SSD2__________________________200105984___________" failed on path "vmhba1:C0:T0:L0" H:0x1 D:0x2 P:0x$"

Looks like you maye be using 'less' and cutting off the end of lines, try using 'cless' or less with line-wrap switch.

Please recreate the issue via GUI, retry using CLI (e.g. esxcli vsan storage add -s naa.xxxxxx -d naa.xxxxxxx) and then pull the vobd.log and vmkernel.log and attach them here.

VMware Knowledge Base

You could also try switching the devices into one of the other hosts and testing if the problem follows it (indicating a hardware issue).

Bob

0 Kudos
Grim192
Enthusiast
Enthusiast

I've fixed it !!! :smileycool:

Host 1 was on an older version of esxi, this wasn't the initial cause of the issue. I tried upgrading the hosts when it originally failed and I must have forgot to upgrade host 1 when I moved the local storage to another host. I still don't actually know what caused the VSAN disk failure with the SSD on host 1.:smileyconfused:

0 Kudos