Situation: I have 2 identical ESXi 6.0 (same build 13635687), fairly identical hardware. Both have two mounts to different iSCSI devices (a old Thecus, a new QNAP).
The QNAP made a power cycle. One of my ESXi lost the connection and lost the iSCSI drive. The other recovered the connection. I tried to reboot the lost ESXi, the iSCSI mount to Thecus came up, the QNAP mount was still missing. I removed the dynamic ISCSI route (x.y.z.a:3160) and re-entered CHAP creds, and it shows up, but no iSCSI partition is found.
The working ESXi tells me this partition table:
[root@localhost:~] esxcli storage core device partition list
Device Partition Start Sector End Sector Type Size
------------------------------------ --------- ------------ ---------- ---- -------------
naa.6e843b620dac99cdd168d477eda0fcd7 0 0 7193231360 0 3682934456320
naa.6e843b620dac99cdd168d477eda0fcd7 1 2048 7193231327 fb 3682933390848
The non-working ESXi tells me this layout:
[root@ESXi1:~] esxcli storage core device partition list
Device Partition Start Sector End Sector Type Size
------------------------------------ --------- ------------ ---------- ---- -------------
naa.644a84203ae4650024b3278c129c78d4 0 0 467664896 0 239444426752
naa.644a84203ae4650024b3278c129c78d4 1 2048 1403387904 83 718533558272
So technically, they do not see the same partitions....
and the non-mounting ESXi tells me (and is right about his view):
[root@ESXi1:~] partedUtil getptbl /dev/disks/naa.644a84203ae4650024b3278c129c78d4
Error: Can't have a partition outside the disk!
Unable to read partition table for device /dev/disks/naa.644a84203ae4650024b3278c129c78d4
Any help is appreciated.
(Note I did not re-boot the working ESXi, I am really to scared it looses the connection too).
First of all unmount the iSCSI volume on the host that sees the wrong table.
Then try
vmkfstools -V
and report results.
Ulli
I'm at loss how to unmount the iSCSI volume. I removed the sendtarget, but the device still shows up:
[root@ESXi1:~] esxcli storage core device partition list
Device Partition Start Sector End Sector Type Size
------------------------------------ --------- ------------ ---------- ---- -------------
naa.644a84203ae4650024b3278c129c78d4 0 0 467664896 0 239444426752
naa.644a84203ae4650024b3278c129c78d4 1 2048 1403387904 83 718533558272
vmkfstools -V does not put out anything, no error, nothing.
It is too risky to continue this without seeing the complete picture.
We cant mess with something that is detected as Linuxfilesystem as it may do serious harm on the qnap.
If necessary - lets look via teamviewer .....
*** Big KUDOS to continuum (Ulli) for offering and actually looking into the problem right in the middle of the night. While we were unable to fix it on the ESXi side, we were able to pinpoint the network as culprit. ***
To follow up on this issue, the root cause was a network issue between the (multihomed) QNAP and the ESXi server which also had a bonding gigabit connection. Basically, the network issue was that the (link aggregation) connection between the 10gigabit switch stack and the 1gigabit switch stack (both Cisco) got flaky. In fact, a number of additional connectivity issues arose while investigating (WLAN AP could not talk to the firewall etc.), and the ESXi-iSCSI issue was only the first problem that was hitting this problem. We were unable to fully reset the network switches, since there were still other production systems up.
Since the network was acting up, we kept looking for a workaround. I then hardware-swapped the network interface of the failing ESXi server to a module with 10Gigabit connections and connected this to the 10gigabit switch network side. Then the ESXi was able to talk to the QNAP iSCSI again, and we could re-register all VMs again.
After this network move had happened, it took another 30 minutes to time out any ARP cache issues and then the network was out of its flaky mode also.