marauli
Enthusiast
Enthusiast

Troubleshooting iSCSI

Our Meraki network switches had their firmware updated, and all of a sudden 3 out of 4 ESXis lost connectivity to an iSCSI target.

(The target is a Dell r730 server running Ubuntu 22.04.3 LTS with the sole purpose of being iSCSI storage for a VMware cluster. The ESXi hosts are 7.03.)

When scanning the iSCSI storage adapter on an ESXi host that can no longer mount the datastore, the host appears to recognize the LUN: it adds an item to "static targets" - based I am assuming on scanning the dynamic ones:

marauli_0-1691435165315.png

(The device / target / LUN is highlighted in green.)

Yet I don't see it presented as a "device" though (which I could mount as a datastore, or where I could create one) under "devices":

marauli_1-1691435378089.png

For comparison, here is one of the ESXis that can see the device:

marauli_2-1691435556940.png

It shows as "degraded" (probably because of lack of NIC redundancy - where would I look to confirm?) - yet it does show up, and I can seemingly create a datastore on that target.

I also spun up an older standalone ESXi 6.7 and it can also see the device. My Windows desktop - ditto.

How would I troubleshoot this issue on the ESXi hosts that can't seem to recognize the iSCSI target as a valid device?

Thanks!

P.S. (Edit) '/var/log/vobd.log' has a number of these pointing to a network configuration issue:

 

2023-08-04T23:31:20.246Z: [iscsiCorrelator] 624003451802us: [vob.iscsi.target.connect.error] vmhba64 @ vmk1 failed to login to iqn.1988-11.com.dell:01.array.bc305bf24e32 because of a network connection failure.
2023-08-04T23:31:20.246Z: [iscsiCorrelator] 624000365087us: [esx.problem.storage.iscsi.target.connect.error] Login to iSCSI target iqn.1988-11.com.dell:01.array.bc305bf24e32 on vmhba64 @ vmk1 failed. The iSCSI initiator could not establish a network connection to the target.
2023-08-07T16:25:21.187Z: [iscsiCorrelator] 857645568069us: [vob.iscsi.discovery.connect.error] discovery failure on vmhba64 to r730b-00.datastores.infra.<masked>.com because of a network connection failure.
2023-08-07T16:25:21.187Z: [iscsiCorrelator] 857641306178us: [esx.problem.storage.iscsi.discovery.connect.error] iSCSI discovery to r730b-00.datastores.infra.<masked>.com on vmhba64 failed. The iSCSI Initiator could not establish a network connection to the discovery address.

 

What command could I run on the affected ESXi hosts to confirm lack of necessary connectivity to the target?

Labels (2)
Reply
0 Kudos
CarltonR
Hot Shot
Hot Shot

Would suggest first validating the [Cisco] Meraki switches routing tables, etc., . . . also check ESXi vmkernel.log and syslog.log.

 

As to troubleshooting iSCSI, then these might be a good place to start:


Troubleshooting ESXi connectivity to iSCSI arrays using software initiators (1003952) 16/03/2021
https://kb.vmware.com/s/article/1003952

Troubleshooting ESX and ESXi connectivity to iSCSI arrays using hardware initiators (1003951) 23/09/2013
https://kb.vmware.com/s/article/1003951

 

Also, I'm sure that you've already been through this guide:

Best Practices for Configuring Networking with Software iSCSI
https://docs.vmware.com/en/VMware-vSphere/7.0/vsphere-esxi-vcenter-server-703-storage-guide.pdf

trobertson
Enthusiast
Enthusiast

I tried to use Meraki MS425s for my ToR solution to a VxRail cluster but I continuously received "High pNic error rate detected."  Switched over to Dell S5212Fs for my cluster and the problem is gone.

I understand this was working but stopped after a FW upgrade which is a total P.I.T.A. and due to the lack of visibility in the Meraki platform things are complicated to almost a trial and error approach.  I just forewarn you that you may suffer additional challenges in the future and I wish you good luck.  I'm eager to see if you can get this resolved.

marauli
Enthusiast
Enthusiast

Thanks @trobertson! The switches in question are MS225-48FP (the ESXis and iSCSI devices are connecting to the 10Gb ports on them) and I don't believe it's an option to try a different switch.

Perhaps what I am looking for is something like `telnet <iSCSI target's IP> 3260` from the affected ESXi host. If this gets dropped or times out - whereas the behavior on an unaffected ESXi is different - that would confirm the issue is likely with the switches (or less likely, in the ESXi network configuration, or the iSCSI target's firewall.) I.e. basic troubleshooting steps from someone more knowledgeable about ESXis.

Reply
0 Kudos
marauli
Enthusiast
Enthusiast

Thank you - some of the steps in those articles may be exactly what I was looking for - I'll try them.

Check network connectivity: vmkping I < SW iSCSI vmkernel > Target_IP
Check SW iSCSI port: nc z < Target_IP > 3260
Reply
0 Kudos
marauli
Enthusiast
Enthusiast

Pinging and port connectivity - no issues, e.g.:

[***@*****ESXi01:~] nc -z <iSCSI tgt IP> 3260
Connection to <iSCSI tgt IP> 3260 port [tcp/*] succeeded!
Reply
0 Kudos
stadi13
Hot Shot
Hot Shot

Do you use Jumbo Frames on the iSCSI interface? If so please check if the port settings are the same as configured on the vmkernel.

Reply
0 Kudos
marauli
Enthusiast
Enthusiast

From what I can tell, no jumbo frames. MTU size is the default 1500.

Reply
0 Kudos
CarltonR
Hot Shot
Hot Shot

You outlined above that everything worked before the Meraki switches had their firmware updated, and that after the update only one of the ESXi servers was able to connect to the iSCSI target (Dell R730 server, Ubuntu 22.04.3 LTS).  I would suggest that you focus on the switch/s . . .

Questions:

i. Could you confirm that it was only the switches firmware that was updated, and that none of, or any other parts of the infrastructure or configuration was changed (i.e. that switches swapped out, switches config, routing, cabling, and any config on the ESXi or iSCSI target were not changed in any way, what so ever) ?

ii. How many switches are involved, if more than one, which ESXi and iSCSI target is connected to which switch ?

iii. Is there only one iSCSI target for all of the four ESXi servers ?

iv. Are all of the switches at the same firmware version ?

v. Are all the switches the same make/model/version ?

vi. Have you reviewed the switches firmware version update notes to determine what was changed ?

vii. Are you using VLANs ?

viii. You mentioned that you spun up an 'older standalone ESXi 6.7' and a 'Windows Desktop' both of which could "also see the device".  Would I be correct to assume that 'see the device' to mean that they could see the iSCSI target in question, and mount the storage ?

ix. How many network connections do the each of the ESXi servers have ?

x. How many network connections does the iSCSI target have ?

Suggest:

- I would suggest moving one of the ESXi's that cannot connect to iSCSI target to one of the switches network ports that you know is working (either the, ESXi that works after the switches firmware upgrade, ESXi 6.7 or Win Desktop (assuming 'viii' to be correct)).

- As an experiment, you could exclude the switches from the equation altogether and make an appropriate direct connection . . or alternatively via another type/make of switch.

 

 

marauli
Enthusiast
Enthusiast

Sorry it's taking me so long to respond! (Was fighting other fires.)


i. Could you confirm that it was only the switches firmware that was updated, and that none of, or any other parts of the infrastructure or configuration was changed (i.e. that switches swapped out, switches config, routing, cabling, and any config on the ESXi or iSCSI target were not changed in any way, what so ever) ?

To my knowledge, just the firmware - although can't be 100% sure. The network infra is handled by someone else, and I have limited access to it. On VMware side - I do have full access, and do not see any changes made to ESXis or to the iSCSI system.

(One possible relevant bit of info: the affected ESXis (3 out of 4) pre-date me joining the team, i.e. configured by someone else originally. The last one, that is unaffected by the change - added to the cluster and configured by yours truly. I pored over network config pages on all ESXis trying to zero in on what could be different between the affected and unaffected ESXis - can't find anything. Did the same in Meraki - ditto.)


ii. How many switches are involved, if more than one, which ESXi and iSCSI target is connected to which switch ?

Around 4-5: there are four 10Gb ports on each switch, each connected system uses 2 of them, and between 4 ESXis and 1 iSCSI target, ten 10Gb ports are used across a number of switches.


iii. Is there only one iSCSI target for all of the four ESXi servers ?

Two for the 1st three ESXis, one for the last one. The 2nd one is a Dell/EMC ME4024 flash array direct-attached (via direct 10Gb links, no switches involved) to the 1st three ESXis.


iv. Are all of the switches at the same firmware version ?

v. Are all the switches the same make/model/version ?

vi. Have you reviewed the switches firmware version update notes to determine what was changed ?


Yes, yes; see no notes.


vii. Are you using VLANs ?


We do use VLANs, and the ports on the switches are configured the same away across all ESXis and the iSCSI target - at least while we're troubleshooting the issue:

 

Type           Trunk
Native VLAN    <masked>
Allowed VLANs  all
Access policy  Open

 

 


viii. You mentioned that you spun up an 'older standalone ESXi 6.7' and a 'Windows Desktop' both of which could "also see the device".  Would I be correct to assume that 'see the device' to mean that they could see the iSCSI target in question, and mount the storage ?


Correct.


ix. How many network connections do the each of the ESXi servers have ?

x. How many network connections does the iSCSI target have ?


The first three (affected) ESXis: six total:

  • two 10Gb NICs for general traffic, vMotion, switched iSCSI
  • two 1Gb legacy NICs: still connected but no longer active (no attached port groups, vSwitches or VMkernel adapters are connected to them)
  • two 10Gb NICs for direct-attached iSCSI (ME4024 mentioned above)

The 4th (unaffected):

  • two 10Gb NICs for general traffic, vMotion, switched iSCSI
  • (it's not connected to ME4024)

iSCSI target: two 10Gb NICs; only one active now; the 2nd one is connected to a switch port that was disabled by our network admin as for some reason there were IP conflict alarms from Meraki on the two ports for the target despite no apparent conflicts (the IPs are different).


- I would suggest moving one of the ESXi's that cannot connect to iSCSI target to one of the switches network ports that you know is working (either the, ESXi that works after the switches firmware upgrade, ESXi 6.7 or Win Desktop (assuming 'viii' to be correct)).

- As an experiment, you could exclude the switches from the equation altogether and make an appropriate direct connection . . or alternatively via another type/make of switch.


Thank you! I'll check with the network admin on both options.

Reply
0 Kudos
marauli
Enthusiast
Enthusiast

Could this be relevant? All iSCSI connection failures occur on vmk1 and vmk2 (which have no network connection to the target), and there isn't anything for vmk1 - which is the NIC that the has the network path to the iSCSI target in question.

 

2023-08-07T22:49:25.324Z: [iscsiCorrelator] 569241542us: [esx.problem.storage.iscsi.target.connect.error] Login to iSCSI target iqn.1988-11.com.dell:01.array.bc305bf24e32 on vmhba64 @ vmk2 failed. The iSCSI initiator could not establish a network connection to the target.
2023-08-07T22:49:25.325Z: [iscsiCorrelator] 569233690us: [vob.iscsi.target.connect.error] vmhba64 @ vmk2 failed to login to iqn.1988-11.com.dell:01.array.bc305bf24e32 because of a network connection failure.
2023-08-07T22:49:25.326Z: [iscsiCorrelator] 569242709us: [esx.problem.storage.iscsi.target.connect.error] Login to iSCSI target iqn.1988-11.com.dell:01.array.bc305bf24e32 on vmhba64 @ vmk2 failed. The iSCSI initiator could not establish a network connection to the target.
2023-08-07T22:49:25.326Z: [iscsiCorrelator] 569234520us: [vob.iscsi.target.connect.error] vmhba64 @ vmk1 failed to login to iqn.1988-11.com.dell:01.array.bc305bf24e32 because of a network connection failure.
2023-08-07T22:49:25.326Z: [iscsiCorrelator] 569243520us: [esx.problem.storage.iscsi.target.connect.error] Login to iSCSI target iqn.1988-11.com.dell:01.array.bc305bf24e32 on vmhba64 @ vmk1 failed. The iSCSI initiator could not establish a network connection to the target.
2023-08-07T22:49:25.327Z: [iscsiCorrelator] 569235334us: [vob.iscsi.target.connect.error] vmhba64 @ vmk1 failed to login to iqn.1988-11.com.dell:01.array.bc305bf24e32 because of a network connection failure.
2023-08-07T22:49:25.327Z: [iscsiCorrelator] 569244253us: [esx.problem.storage.iscsi.target.connect.error] Login to iSCSI target iqn.1988-11.com.dell:01.array.bc305bf24e32 on vmhba64 @ vmk1 failed. The iSCSI initiator could not establish a network connection to the target.

 

vmk1 and vmk2 are kernel NICs dedicated to direct-attached iSCSI connections. They do not have a path to the iSCSI target in question that is on a switched network.

marauli_0-1693256987160.png

I guess I am puzzled why doesn't the ESXi attempt to connect to the target using vmk0, and what I can do to force it to.

 

Reply
0 Kudos
kastlr
Expert
Expert

Hi,

your description is a bit confusing,

In your initial thread you refer to vmk1 & vmk2 and added also snippets from the configuration.

in your last thread you mention that vmk0 is the one to focus on and that vmk1 & vmk2 aren't the interfaces which should handle the iSCSI traffic/connection

So which statement is correct?


Hope this helps a bit.
Greetings from Germany. (CEST)
Reply
0 Kudos
marauli
Enthusiast
Enthusiast

Sorry about that - it's all very confusing to me too.

vmk1 and 2 should not handle connections to the iSCSI target in question (located on a switched network). Only vmk0 can handle those as that's the only adapter with a path to it.

(vmk1 and 2 are adapters for direct-attached connections only, i.e. for targets not on a switched network)

If that still doesn't clear it up - let me know.

Reply
0 Kudos
kastlr
Expert
Expert

Hi,

 

so if we should assist you on reestablishing an iSCSI connection between vmk0 and your iSCSI Target we would need information about your environment.

  • Are you able to ping the iSCSI target(s) ip addresses using vmkping -I vmk0 -d -s 1450 <ip address iSCSI target)?
  • How do the NIC stats look like using esxcli network nic stats get -n vmnicX?
  • Perform some of the tests described earlier and post the results.

Hope this helps a bit.
Greetings from Germany. (CEST)
marauli
Enthusiast
Enthusiast


Are you able to ping the iSCSI target(s) ip addresses using vmkping -I vmk0 -d -s 1450 <ip address iSCSI target)?
[root@***-ESXi-01:~] vmkping -I vmk0 -d -s 1450 <IP>
PING <IP> (<IP>): 1450 data bytes
1458 bytes from <IP>: icmp_seq=0 ttl=64 time=0.224 ms
1458 bytes from <IP>: icmp_seq=1 ttl=64 time=0.277 ms
1458 bytes from <IP>: icmp_seq=2 ttl=64 time=0.278 ms

--- <IP> ping statistics ---
3 packets transmitted, 3 packets received, 0% packet loss
round-trip min/avg/max = 0.224/0.260/0.278 ms

  • How do the NIC stats look like using esxcli network nic stats get -n vmnicX?

NIC statistics for vmnic0
   Packets received: 1696266074
   Packets sent: 2119143975
   Bytes received: 645191082460
   Bytes sent: 2243595138013
   Receive packets dropped: 0
   Transmit packets dropped: 0
   Multicast packets received: 108342319
   Broadcast packets received: 30483370
   Multicast packets sent: 1932358
   Broadcast packets sent: 1573782
   Total receive errors: 0
   Receive length errors: 0
   Receive over errors: 0
   Receive CRC errors: 0
   Receive frame errors: 0
   Receive FIFO errors: 0
   Receive missed errors: 0
   Total transmit errors: 0
   Transmit aborted errors: 0
   Transmit carrier errors: 0
   Transmit FIFO errors: 0
   Transmit heartbeat errors: 0
   Transmit window errors: 0

NIC statistics for vmnic1
   Packets received: 471521582
   Packets sent: 212317143
   Bytes received: 314177704382
   Bytes sent: 80328464786
   Receive packets dropped: 0
   Transmit packets dropped: 0
   Multicast packets received: 110083599
   Broadcast packets received: 32038626
   Multicast packets sent: 257459
   Broadcast packets sent: 19461
(all "error" values - 0)

NIC statistics for vmnic2
   Packets received: 140071964
   Packets sent: 0
   Bytes received: 13097333291
   Bytes sent: 0
   Receive packets dropped: 0
   Transmit packets dropped: 0
   Multicast packets received: 107577113
   Broadcast packets received: 32255292
   Multicast packets sent: 0
   Broadcast packets sent: 0
(all "error" values - 0)

NIC statistics for vmnic3
   Packets received: 140073820
   Packets sent: 0
   Bytes received: 13097515909
   Bytes sent: 0
   Receive packets dropped: 0
   Transmit packets dropped: 0
   Multicast packets received: 107578538
   Broadcast packets received: 32255717
   Multicast packets sent: 0
   Broadcast packets sent: 0
(all "error" values - 0)

NIC statistics for vmnic4
   Packets received: 12515272569
   Packets sent: 2463924551
   Bytes received: 17668531459730
   Bytes sent: 2695041739606
   Receive packets dropped: 0
   Transmit packets dropped: 0
   Multicast packets received: 0
   Broadcast packets received: 2
   Multicast packets sent: 0
   Broadcast packets sent: 1699
(all "error" values - 0)

NIC statistics for vmnic5
   Packets received: 140778
   Packets sent: 266174
   Bytes received: 16919128
   Bytes sent: 24241914
   Receive packets dropped: 0
   Transmit packets dropped: 0
   Multicast packets received: 0
   Broadcast packets received: 2
   Multicast packets sent: 0
   Broadcast packets sent: 1695
(all "error" values - 0)

 

Reply
0 Kudos
a_p_
Leadership
Leadership

What I'm worried about is the iSCSI Port Binding configuration.
Port Binding should only be used in single-subnet iSCSI configurations, i.e. where all initiators (vmkX) are able to reach all targets. This however is not the setup that you are using, is it?

Is this configured the same as on the other, working ESXi hosts?

André

marauli
Enthusiast
Enthusiast

Agreed, that was my bad - enabled it before reading up on it - but after this issue occurred - yet it doesn't appear to affect anything.

First three hosts have the same configuration, the fourth one doesn't (as it's not connected to the direct-attached ME4024 iSCSI target, unlike the first three).

As soon as I can put one of them in maintenance, I'll remove the binding, see if that does anything - but am not expecting it to.

(Putting it in maintenance is tricky as our cluster is running low on memory - so I am trying to tread gently.)

Reply
0 Kudos
marauli
Enthusiast
Enthusiast


yet it doesn't appear to affect anything.

I was wrong on that one.

Put one of the affected hosts in maintenance, removed the binding, re-scanned the adapter - and while the ESXi still doesn't see the iSCSI target in question, it does now see a test one I just created yesterday on an Ubuntu VM.

marauli_0-1693422915660.png

(In green are the paths to the new test iSCSI target, in red - to iSCSI target the hosts can no longer see.)

marauli_1-1693423091250.png

Another new development is that the 4th ESXI that used to see the iSCSI target in question - can no longer see it. (The only change was the server running that iSCSI storage - Ubuntu Linux on Dell r730 - was updated via 'apt-get upgrade' and rebooted today.)

With all this, the culprit is possibly with the Ubuntu Linux iSCSI service configuration (tgt)?

Reply
0 Kudos
kastlr
Expert
Expert

Hi,

based on the latest info you provide it looks like

  • IP connections between Host(s) and iSCSI Targets are possible, regardless if direct attached or via switched environment
  • taken from the snippet that node also seems to see at least 2 iSCSI LUNs on the Dell R730 Server
  • As you already rebooted the iSCSI Target any kind of stale LUN reservation should be cleared

So, in theory everything looks fine, but did you ever perform a Rescan operation from your ESXi Hosts (or Cluster wide) to verify if the target (and it's LUNs) are still visible after the Rescan?

Might be that we simply would see some stale entries, so running a rescan operation would be a valid trouble shooting step.

Depending on the result (LUNs are or aren't visible after the rescan) we should have a better understanding and should be able to offer additional steps.

 


Hope this helps a bit.
Greetings from Germany. (CEST)
marauli
Enthusiast
Enthusiast


did you ever perform a Rescan operation from your ESXi Hosts (or Cluster wide) to verify if the target (and it's LUNs) are still visible after the Rescan?

At least a few dozen times - I rescan any time I try any change with a network or storage configuration.

I have a feeling something on the iSCSI target (r730b) might be misconfigured given it disappeared even from ESXis it was present before, while my test iSCSI target (on the same Ubuntu version) seems to be working OK, visible on multiple ESXis. Will have to dive into Ubuntu networking and iSCSI (tgt) configuration.

Reply
0 Kudos