VMware Cloud Community
culbeda
Contributor
Contributor

ESX iSCSI Clustering issues

I have a lab that I'm using for some VMWare testing with the following:

2 x HP DL360 G3's running ESX 3.5 with latest patches (upgraded from 3.02)

1 x Windows 2003 SP2 system running VirtualCenter with the two ESX servers configured in a cluster w/ HA and DRS

1 x Thecus N5200 Pro NAS with iSCSI (N5200BR Pro to be exact) with a single iSCSI LUN

1x OpenFiler 2.2.r1166-1-1 system with a single iSCSI LUN

I have configured the network, storage adapters, firewall and I'm able to attach to each of the iSCSI LUNs from either one of the machines. I have a VMFS volume created on each and they're working properly.

The problem comes in the moment I attach on the second machine to the iSCSI. I immediately lose connection on the other server. For example: Server "vmlab1" mounts both iSCSI LUNs (1 from Thecus and 1 from OpenFiler), mounts VMFS, the volumes are listed and acessible. I can move files between local stroage and the VMFS. As soon as I start "vmlab2" or rescan for iSCSI targets on vmlab2, I lose my connections to both iSCSI LUNs on vmlab1. The exact same is true in reverse as well.

I can watch these all day long and one of them will be properly connected and the other will show red when I do an ls of /vmfs/volumes. One of them will always be connected and the other will not. I've tried a variety of settings on the remote devices as well as the vmkiscsi.conf on the ESX servers. I've made the configurations match as closely as possible. I've tried disabling HeaderDigests and DataDigests. I've tried matching up the data segment lengths, InitialR2T settings and making sure that I had a sufficiently large MaxConnections specified on the targets. Oh and I've set Continous to "No".

I know that others are using either the Thecus or OpenFiler with success, but I just can't get either to get around this problem. Here are my questions before I start trying to sniff traffic to determine the root cause of the issue:

1) Does anyone have any ideas as to what would cause this or steps that they would use to troubleshoot this?

2) Can anyone provide any insight into troubleshooting iSCSI on VMWare? (Logging options, etc)

3) Can anyone using OpenFiler or a Thecus NAS share their configuration for ESX and the iSCSI target for reference in case I've missed some magic bullet?

4) Does anyone have any good resources for the flow of iSCSI traffic for reference?

Many thanks in advance!

0 Kudos
3 Replies
JeffST
Enthusiast
Enthusiast

Have a look at and for some possible reasons.

But you probably knew that already or don't care because it is a lab environment.

Monitoring /var/log/messages and /var/log/vmkernel can be helpful while troubleshooting

0 Kudos
mike_laspina
Champion
Champion

The number of possibilities on this one is very high you need to divide it to one side or the other.

Logs - check vmkwarning

Here is what I would start doing.

Down the ESX hosts. Install MS iSCSI initiators on two VM/PC's etc. setup the initiators names and ip's exactly as the ESX hosts. Test if you can connect the two initiators.

If that fails you will not need to look at ESX.

http://blog.laspina.ca/ vExpert 2009
0 Kudos
culbeda
Contributor
Contributor

As odd as it may seem, I took the two nodes out of the cluster, installed them with a fresh copy of 3.5 (same CD), openned the ports in the firewall, added the VMKernel, enabled iSCSI, added a target for discovery and voila! Now I have a "stable" ESX cluster. (Stable enough to a lab anyway.)

Now I need to work on getting my IET (boxes OpenFiler and Thecus N5200 Pro) upgraded to IET version 0.4.15 so that they support iSCSI reserve and release. (http://www.rtfm-ed.co.uk/?p=487)

Unfortunately, I still don't know exactly what caused this debacle, but I have images of both of my old servers for a post mortem. The only thing I can think of is that either (A) upgrading from 3.01 to 3.02, to various patches to 3.5 (recent CD, not the old version that would not upgrade properly) caused the problem or (B) the order of steps I took last time caused the problem. Either way, it was INFINITELY simpler to rebuild the servers. I have learned my lesson there. Smiley Wink

Thanks again to everyone for their assistance.

0 Kudos