after power loss (iSCSI dev): iSCSI mount impossib...

Engineer5 · ‎01-19-2021

Situation: I have 2 identical ESXi 6.0 (same build 13635687), fairly identical hardware. Both have two mounts to different iSCSI devices (a old Thecus, a new QNAP).

The QNAP made a power cycle. One of my ESXi lost the connection and lost the iSCSI drive. The other recovered the connection. I tried to reboot the lost ESXi, the iSCSI mount to Thecus came up, the QNAP mount was still missing. I removed the dynamic ISCSI route (x.y.z.a:3160) and re-entered CHAP creds, and it shows up, but no iSCSI partition is found.

The working ESXi tells me this partition table:

[root@localhost:~] esxcli storage core device partition list

Device Partition Start Sector End Sector Type Size

------------------------------------ --------- ------------ ---------- ---- -------------

naa.6e843b620dac99cdd168d477eda0fcd7 0 0 7193231360 0 3682934456320

naa.6e843b620dac99cdd168d477eda0fcd7 1 2048 7193231327 fb 3682933390848

The non-working ESXi tells me this layout:

[root@ESXi1:~] esxcli storage core device partition list

Device Partition Start Sector End Sector Type Size

------------------------------------ --------- ------------ ---------- ---- -------------

naa.644a84203ae4650024b3278c129c78d4 0 0 467664896 0 239444426752

naa.644a84203ae4650024b3278c129c78d4 1 2048 1403387904 83 718533558272

So technically, they do not see the same partitions....

and the non-mounting ESXi tells me (and is right about his view):

[root@ESXi1:~] partedUtil getptbl /dev/disks/naa.644a84203ae4650024b3278c129c78d4

Error: Can't have a partition outside the disk!

Unable to read partition table for device /dev/disks/naa.644a84203ae4650024b3278c129c78d4

Any help is appreciated.

(Note I did not re-boot the working ESXi, I am really to scared it looses the connection too).

continuum · ‎01-19-2021

First of all unmount the iSCSI volume on the host that sees the wrong table.

Then try

vmkfstools -V

and report results.

Ulli

________________________________________________
Do you need support with a VMFS recovery problem ? - send a message via skype "sanbarrow"
I do not support Workstation 16 at this time ...

Engineer5 · ‎01-19-2021

I'm at loss how to unmount the iSCSI volume. I removed the sendtarget, but the device still shows up:

[root@ESXi1:~] esxcli storage core device partition list

Device Partition Start Sector End Sector Type Size

------------------------------------ --------- ------------ ---------- ---- -------------

naa.644a84203ae4650024b3278c129c78d4 0 0 467664896 0 239444426752

naa.644a84203ae4650024b3278c129c78d4 1 2048 1403387904 83 718533558272

vmkfstools -V does not put out anything, no error, nothing.

continuum · ‎01-19-2021

It is too risky to continue this without seeing the complete picture.

We cant mess with something that is detected as Linuxfilesystem as it may do serious harm on the qnap.

If necessary - lets look via teamviewer .....

________________________________________________
Do you need support with a VMFS recovery problem ? - send a message via skype "sanbarrow"
I do not support Workstation 16 at this time ...

Engineer5 · ‎01-20-2021

*** Big KUDOS to continuum (Ulli) for offering and actually looking into the problem right in the middle of the night. While we were unable to fix it on the ESXi side, we were able to pinpoint the network as culprit. ***

To follow up on this issue, the root cause was a network issue between the (multihomed) QNAP and the ESXi server which also had a bonding gigabit connection. Basically, the network issue was that the (link aggregation) connection between the 10gigabit switch stack and the 1gigabit switch stack (both Cisco) got flaky. In fact, a number of additional connectivity issues arose while investigating (WLAN AP could not talk to the firewall etc.), and the ESXi-iSCSI issue was only the first problem that was hitting this problem. We were unable to fully reset the network switches, since there were still other production systems up.

Since the network was acting up, we kept looking for a workaround. I then hardware-swapped the network interface of the failing ESXi server to a module with 10Gigabit connections and connected this to the 10gigabit switch network side. Then the ESXi was able to talk to the QNAP iSCSI again, and we could re-register all VMs again.

After this network move had happened, it took another 30 minutes to time out any ARP cache issues and then the network was out of its flaky mode also.

All

after power loss (iSCSI dev): iSCSI mount impossible - partition table shows different layout