Our Meraki network switches had their firmware updated, and all of a sudden 3 out of 4 ESXis lost connectivity to an iSCSI target.
(The target is a Dell r730 server running Ubuntu 22.04.3 LTS with the sole purpose of being iSCSI storage for a VMware cluster. The ESXi hosts are 7.03.)
When scanning the iSCSI storage adapter on an ESXi host that can no longer mount the datastore, the host appears to recognize the LUN: it adds an item to "static targets" - based I am assuming on scanning the dynamic ones:
(The device / target / LUN is highlighted in green.)
Yet I don't see it presented as a "device" though (which I could mount as a datastore, or where I could create one) under "devices":
For comparison, here is one of the ESXis that can see the device:
It shows as "degraded" (probably because of lack of NIC redundancy - where would I look to confirm?) - yet it does show up, and I can seemingly create a datastore on that target.
I also spun up an older standalone ESXi 6.7 and it can also see the device. My Windows desktop - ditto.
How would I troubleshoot this issue on the ESXi hosts that can't seem to recognize the iSCSI target as a valid device?
Thanks!
P.S. (Edit) '/var/log/vobd.log' has a number of these pointing to a network configuration issue:
2023-08-04T23:31:20.246Z: [iscsiCorrelator] 624003451802us: [vob.iscsi.target.connect.error] vmhba64 @ vmk1 failed to login to iqn.1988-11.com.dell:01.array.bc305bf24e32 because of a network connection failure.
2023-08-04T23:31:20.246Z: [iscsiCorrelator] 624000365087us: [esx.problem.storage.iscsi.target.connect.error] Login to iSCSI target iqn.1988-11.com.dell:01.array.bc305bf24e32 on vmhba64 @ vmk1 failed. The iSCSI initiator could not establish a network connection to the target.
2023-08-07T16:25:21.187Z: [iscsiCorrelator] 857645568069us: [vob.iscsi.discovery.connect.error] discovery failure on vmhba64 to r730b-00.datastores.infra.<masked>.com because of a network connection failure.
2023-08-07T16:25:21.187Z: [iscsiCorrelator] 857641306178us: [esx.problem.storage.iscsi.discovery.connect.error] iSCSI discovery to r730b-00.datastores.infra.<masked>.com on vmhba64 failed. The iSCSI Initiator could not establish a network connection to the discovery address.
What command could I run on the affected ESXi hosts to confirm lack of necessary connectivity to the target?
The culprit wasn't network port binding - it seems to have something to do with ESXis holding on to stale paths and items in Dynamic and Static Discovery.
Here are some tests I did:
To sum up, if an iSCSI datastore disappeared on an ESXi host and isn't showing up no matter how many times you reboot or rescan, try this:
This worked for me and I hope this works for someone else in a similar situation.
Would suggest first validating the [Cisco] Meraki switches routing tables, etc., . . . also check ESXi vmkernel.log and syslog.log.
As to troubleshooting iSCSI, then these might be a good place to start:
Troubleshooting ESXi connectivity to iSCSI arrays using software initiators (1003952) 16/03/2021
https://kb.vmware.com/s/article/1003952
Troubleshooting ESX and ESXi connectivity to iSCSI arrays using hardware initiators (1003951) 23/09/2013
https://kb.vmware.com/s/article/1003951
Also, I'm sure that you've already been through this guide:
Best Practices for Configuring Networking with Software iSCSI
https://docs.vmware.com/en/VMware-vSphere/7.0/vsphere-esxi-vcenter-server-703-storage-guide.pdf
I tried to use Meraki MS425s for my ToR solution to a VxRail cluster but I continuously received "High pNic error rate detected." Switched over to Dell S5212Fs for my cluster and the problem is gone.
I understand this was working but stopped after a FW upgrade which is a total P.I.T.A. and due to the lack of visibility in the Meraki platform things are complicated to almost a trial and error approach. I just forewarn you that you may suffer additional challenges in the future and I wish you good luck. I'm eager to see if you can get this resolved.
Thanks @trobertson! The switches in question are MS225-48FP (the ESXis and iSCSI devices are connecting to the 10Gb ports on them) and I don't believe it's an option to try a different switch.
Perhaps what I am looking for is something like `telnet <iSCSI target's IP> 3260` from the affected ESXi host. If this gets dropped or times out - whereas the behavior on an unaffected ESXi is different - that would confirm the issue is likely with the switches (or less likely, in the ESXi network configuration, or the iSCSI target's firewall.) I.e. basic troubleshooting steps from someone more knowledgeable about ESXis.
Thank you - some of the steps in those articles may be exactly what I was looking for - I'll try them.
Check network connectivity: vmkping I < SW iSCSI vmkernel > Target_IP
Check SW iSCSI port: nc z < Target_IP > 3260
Pinging and port connectivity - no issues, e.g.:
[***@*****ESXi01:~] nc -z <iSCSI tgt IP> 3260
Connection to <iSCSI tgt IP> 3260 port [tcp/*] succeeded!
Do you use Jumbo Frames on the iSCSI interface? If so please check if the port settings are the same as configured on the vmkernel.
From what I can tell, no jumbo frames. MTU size is the default 1500.
You outlined above that everything worked before the Meraki switches had their firmware updated, and that after the update only one of the ESXi servers was able to connect to the iSCSI target (Dell R730 server, Ubuntu 22.04.3 LTS). I would suggest that you focus on the switch/s . . .
Questions:
i. Could you confirm that it was only the switches firmware that was updated, and that none of, or any other parts of the infrastructure or configuration was changed (i.e. that switches swapped out, switches config, routing, cabling, and any config on the ESXi or iSCSI target were not changed in any way, what so ever) ?
ii. How many switches are involved, if more than one, which ESXi and iSCSI target is connected to which switch ?
iii. Is there only one iSCSI target for all of the four ESXi servers ?
iv. Are all of the switches at the same firmware version ?
v. Are all the switches the same make/model/version ?
vi. Have you reviewed the switches firmware version update notes to determine what was changed ?
vii. Are you using VLANs ?
viii. You mentioned that you spun up an 'older standalone ESXi 6.7' and a 'Windows Desktop' both of which could "also see the device". Would I be correct to assume that 'see the device' to mean that they could see the iSCSI target in question, and mount the storage ?
ix. How many network connections do the each of the ESXi servers have ?
x. How many network connections does the iSCSI target have ?
Suggest:
- I would suggest moving one of the ESXi's that cannot connect to iSCSI target to one of the switches network ports that you know is working (either the, ESXi that works after the switches firmware upgrade, ESXi 6.7 or Win Desktop (assuming 'viii' to be correct)).
- As an experiment, you could exclude the switches from the equation altogether and make an appropriate direct connection . . or alternatively via another type/make of switch.
Sorry it's taking me so long to respond! (Was fighting other fires.)
i. Could you confirm that it was only the switches firmware that was updated, and that none of, or any other parts of the infrastructure or configuration was changed (i.e. that switches swapped out, switches config, routing, cabling, and any config on the ESXi or iSCSI target were not changed in any way, what so ever) ?
To my knowledge, just the firmware - although can't be 100% sure. The network infra is handled by someone else, and I have limited access to it. On VMware side - I do have full access, and do not see any changes made to ESXis or to the iSCSI system.
(One possible relevant bit of info: the affected ESXis (3 out of 4) pre-date me joining the team, i.e. configured by someone else originally. The last one, that is unaffected by the change - added to the cluster and configured by yours truly. I pored over network config pages on all ESXis trying to zero in on what could be different between the affected and unaffected ESXis - can't find anything. Did the same in Meraki - ditto.)
ii. How many switches are involved, if more than one, which ESXi and iSCSI target is connected to which switch ?
Around 4-5: there are four 10Gb ports on each switch, each connected system uses 2 of them, and between 4 ESXis and 1 iSCSI target, ten 10Gb ports are used across a number of switches.
iii. Is there only one iSCSI target for all of the four ESXi servers ?
Two for the 1st three ESXis, one for the last one. The 2nd one is a Dell/EMC ME4024 flash array direct-attached (via direct 10Gb links, no switches involved) to the 1st three ESXis.
iv. Are all of the switches at the same firmware version ?v. Are all the switches the same make/model/version ?
vi. Have you reviewed the switches firmware version update notes to determine what was changed ?
Yes, yes; see no notes.
vii. Are you using VLANs ?
We do use VLANs, and the ports on the switches are configured the same away across all ESXis and the iSCSI target - at least while we're troubleshooting the issue:
Type Trunk
Native VLAN <masked>
Allowed VLANs all
Access policy Open
viii. You mentioned that you spun up an 'older standalone ESXi 6.7' and a 'Windows Desktop' both of which could "also see the device". Would I be correct to assume that 'see the device' to mean that they could see the iSCSI target in question, and mount the storage ?
Correct.
ix. How many network connections do the each of the ESXi servers have ?
x. How many network connections does the iSCSI target have ?
The first three (affected) ESXis: six total:
The 4th (unaffected):
iSCSI target: two 10Gb NICs; only one active now; the 2nd one is connected to a switch port that was disabled by our network admin as for some reason there were IP conflict alarms from Meraki on the two ports for the target despite no apparent conflicts (the IPs are different).
- I would suggest moving one of the ESXi's that cannot connect to iSCSI target to one of the switches network ports that you know is working (either the, ESXi that works after the switches firmware upgrade, ESXi 6.7 or Win Desktop (assuming 'viii' to be correct)).
- As an experiment, you could exclude the switches from the equation altogether and make an appropriate direct connection . . or alternatively via another type/make of switch.
Thank you! I'll check with the network admin on both options.
Could this be relevant? All iSCSI connection failures occur on vmk1 and vmk2 (which have no network connection to the target), and there isn't anything for vmk1 - which is the NIC that the has the network path to the iSCSI target in question.
2023-08-07T22:49:25.324Z: [iscsiCorrelator] 569241542us: [esx.problem.storage.iscsi.target.connect.error] Login to iSCSI target iqn.1988-11.com.dell:01.array.bc305bf24e32 on vmhba64 @ vmk2 failed. The iSCSI initiator could not establish a network connection to the target.
2023-08-07T22:49:25.325Z: [iscsiCorrelator] 569233690us: [vob.iscsi.target.connect.error] vmhba64 @ vmk2 failed to login to iqn.1988-11.com.dell:01.array.bc305bf24e32 because of a network connection failure.
2023-08-07T22:49:25.326Z: [iscsiCorrelator] 569242709us: [esx.problem.storage.iscsi.target.connect.error] Login to iSCSI target iqn.1988-11.com.dell:01.array.bc305bf24e32 on vmhba64 @ vmk2 failed. The iSCSI initiator could not establish a network connection to the target.
2023-08-07T22:49:25.326Z: [iscsiCorrelator] 569234520us: [vob.iscsi.target.connect.error] vmhba64 @ vmk1 failed to login to iqn.1988-11.com.dell:01.array.bc305bf24e32 because of a network connection failure.
2023-08-07T22:49:25.326Z: [iscsiCorrelator] 569243520us: [esx.problem.storage.iscsi.target.connect.error] Login to iSCSI target iqn.1988-11.com.dell:01.array.bc305bf24e32 on vmhba64 @ vmk1 failed. The iSCSI initiator could not establish a network connection to the target.
2023-08-07T22:49:25.327Z: [iscsiCorrelator] 569235334us: [vob.iscsi.target.connect.error] vmhba64 @ vmk1 failed to login to iqn.1988-11.com.dell:01.array.bc305bf24e32 because of a network connection failure.
2023-08-07T22:49:25.327Z: [iscsiCorrelator] 569244253us: [esx.problem.storage.iscsi.target.connect.error] Login to iSCSI target iqn.1988-11.com.dell:01.array.bc305bf24e32 on vmhba64 @ vmk1 failed. The iSCSI initiator could not establish a network connection to the target.
vmk1 and vmk2 are kernel NICs dedicated to direct-attached iSCSI connections. They do not have a path to the iSCSI target in question that is on a switched network.
I guess I am puzzled why doesn't the ESXi attempt to connect to the target using vmk0, and what I can do to force it to.
Hi,
your description is a bit confusing,
In your initial thread you refer to vmk1 & vmk2 and added also snippets from the configuration.
in your last thread you mention that vmk0 is the one to focus on and that vmk1 & vmk2 aren't the interfaces which should handle the iSCSI traffic/connection
So which statement is correct?
Sorry about that - it's all very confusing to me too.
vmk1 and 2 should not handle connections to the iSCSI target in question (located on a switched network). Only vmk0 can handle those as that's the only adapter with a path to it.
(vmk1 and 2 are adapters for direct-attached connections only, i.e. for targets not on a switched network)
If that still doesn't clear it up - let me know.
Hi,
so if we should assist you on reestablishing an iSCSI connection between vmk0 and your iSCSI Target we would need information about your environment.
Are you able to ping the iSCSI target(s) ip addresses using vmkping -I vmk0 -d -s 1450 <ip address iSCSI target)?
[root@***-ESXi-01:~] vmkping -I vmk0 -d -s 1450 <IP>
PING <IP> (<IP>): 1450 data bytes
1458 bytes from <IP>: icmp_seq=0 ttl=64 time=0.224 ms
1458 bytes from <IP>: icmp_seq=1 ttl=64 time=0.277 ms
1458 bytes from <IP>: icmp_seq=2 ttl=64 time=0.278 ms
--- <IP> ping statistics ---
3 packets transmitted, 3 packets received, 0% packet loss
round-trip min/avg/max = 0.224/0.260/0.278 ms
- How do the NIC stats look like using esxcli network nic stats get -n vmnicX?
NIC statistics for vmnic0
Packets received: 1696266074
Packets sent: 2119143975
Bytes received: 645191082460
Bytes sent: 2243595138013
Receive packets dropped: 0
Transmit packets dropped: 0
Multicast packets received: 108342319
Broadcast packets received: 30483370
Multicast packets sent: 1932358
Broadcast packets sent: 1573782
Total receive errors: 0
Receive length errors: 0
Receive over errors: 0
Receive CRC errors: 0
Receive frame errors: 0
Receive FIFO errors: 0
Receive missed errors: 0
Total transmit errors: 0
Transmit aborted errors: 0
Transmit carrier errors: 0
Transmit FIFO errors: 0
Transmit heartbeat errors: 0
Transmit window errors: 0
NIC statistics for vmnic1
Packets received: 471521582
Packets sent: 212317143
Bytes received: 314177704382
Bytes sent: 80328464786
Receive packets dropped: 0
Transmit packets dropped: 0
Multicast packets received: 110083599
Broadcast packets received: 32038626
Multicast packets sent: 257459
Broadcast packets sent: 19461
(all "error" values - 0)
NIC statistics for vmnic2
Packets received: 140071964
Packets sent: 0
Bytes received: 13097333291
Bytes sent: 0
Receive packets dropped: 0
Transmit packets dropped: 0
Multicast packets received: 107577113
Broadcast packets received: 32255292
Multicast packets sent: 0
Broadcast packets sent: 0
(all "error" values - 0)
NIC statistics for vmnic3
Packets received: 140073820
Packets sent: 0
Bytes received: 13097515909
Bytes sent: 0
Receive packets dropped: 0
Transmit packets dropped: 0
Multicast packets received: 107578538
Broadcast packets received: 32255717
Multicast packets sent: 0
Broadcast packets sent: 0
(all "error" values - 0)
NIC statistics for vmnic4
Packets received: 12515272569
Packets sent: 2463924551
Bytes received: 17668531459730
Bytes sent: 2695041739606
Receive packets dropped: 0
Transmit packets dropped: 0
Multicast packets received: 0
Broadcast packets received: 2
Multicast packets sent: 0
Broadcast packets sent: 1699
(all "error" values - 0)
NIC statistics for vmnic5
Packets received: 140778
Packets sent: 266174
Bytes received: 16919128
Bytes sent: 24241914
Receive packets dropped: 0
Transmit packets dropped: 0
Multicast packets received: 0
Broadcast packets received: 2
Multicast packets sent: 0
Broadcast packets sent: 1695
(all "error" values - 0)
What I'm worried about is the iSCSI Port Binding configuration.
Port Binding should only be used in single-subnet iSCSI configurations, i.e. where all initiators (vmkX) are able to reach all targets. This however is not the setup that you are using, is it?
Is this configured the same as on the other, working ESXi hosts?
André
Agreed, that was my bad - enabled it before reading up on it - but after this issue occurred - yet it doesn't appear to affect anything.
First three hosts have the same configuration, the fourth one doesn't (as it's not connected to the direct-attached ME4024 iSCSI target, unlike the first three).
As soon as I can put one of them in maintenance, I'll remove the binding, see if that does anything - but am not expecting it to.
(Putting it in maintenance is tricky as our cluster is running low on memory - so I am trying to tread gently.)
yet it doesn't appear to affect anything.
I was wrong on that one.
Put one of the affected hosts in maintenance, removed the binding, re-scanned the adapter - and while the ESXi still doesn't see the iSCSI target in question, it does now see a test one I just created yesterday on an Ubuntu VM.
(In green are the paths to the new test iSCSI target, in red - to iSCSI target the hosts can no longer see.)
Another new development is that the 4th ESXI that used to see the iSCSI target in question - can no longer see it. (The only change was the server running that iSCSI storage - Ubuntu Linux on Dell r730 - was updated via 'apt-get upgrade' and rebooted today.)
With all this, the culprit is possibly with the Ubuntu Linux iSCSI service configuration (tgt)?
Hi,
based on the latest info you provide it looks like
So, in theory everything looks fine, but did you ever perform a Rescan operation from your ESXi Hosts (or Cluster wide) to verify if the target (and it's LUNs) are still visible after the Rescan?
Might be that we simply would see some stale entries, so running a rescan operation would be a valid trouble shooting step.
Depending on the result (LUNs are or aren't visible after the rescan) we should have a better understanding and should be able to offer additional steps.
did you ever perform a Rescan operation from your ESXi Hosts (or Cluster wide) to verify if the target (and it's LUNs) are still visible after the Rescan?
At least a few dozen times - I rescan any time I try any change with a network or storage configuration.
I have a feeling something on the iSCSI target (r730b) might be misconfigured given it disappeared even from ESXis it was present before, while my test iSCSI target (on the same Ubuntu version) seems to be working OK, visible on multiple ESXis. Will have to dive into Ubuntu networking and iSCSI (tgt) configuration.
