VMware Cloud Community
Box293
Enthusiast
Enthusiast
Jump to solution

ESX 4 SAN connectivity not working

We are in the process of upgrading from ESX 3.5 to ESX 4.

We have Dell 1955 servers and an EMC CX3-40 SAN.

Previously when connecting the ESX host to the SAN we would

  • Install the navicli/agent

  • Do the switch zoning

  • Add the host to the CX3-40 storage group

  • Peform a rescan of the hba

  • All the datastores on the SAN would then appear

So as part of our upgrade to ESX 4 we have decided to re-build our ESX servers from scratch.

From the documentation I've read and looking onling the navicli/agent is not required in ESX 4.

So now I am doing:

  • The switch zoning (actually it's already done so there isn't an issue here)

  • Add the host to the CX3-40 storage group

  • Peform a rescan of the hba

  • this is as far as I get.

The rescan task on the ESX host will run ... I've waited about 20 minutes and still nothing.

The first time I did this the host wasn't connected to a vCenter server.

So when doing the re-scan, eventually I will lose connectivity to the host with the vSphere client. I can't re-connect.

I can still ping the ESX host and ssh to it.

The only option I have is to cold power the server off.

I then power it on and as it boots up it takes forever at the storage discovery part (can't remember what it said)

I've waited 20 minutes again and I still can't connect to the host using the vSphere client.

I've pressed ALT + F12 on the console and am seeing lots of stuff that has a white background with black text. Unfortunately I can't make much sense of it as it goes off the edge of the screen (it doesn't wrap to the next line).

The only way to get the host to boot up and be responsive is to disconnect the fiber cables from the ESX host.

I have tried this on two seperate ESX hosts that have been built from scratch and both times the same issues are occuring.

Please help.

I am unsure how to diagnose this issue, where I should start looking etc.

VCP3 32846

VSP4 VML-306798

VCP3 & VCP4 32846 VSP4 VTSP4
Tags (3)
0 Kudos
1 Solution

Accepted Solutions
dcoz
Hot Shot
Hot Shot
Jump to solution

Box293,

I would just double check in the SAN connectivity status that the host is showing as Logged in and registered.

If this is ok then i also found this vmware KB article

Hope this helps.

regards

DC

View solution in original post

0 Kudos
9 Replies
dcoz
Hot Shot
Hot Shot
Jump to solution

Box293

Does the ESX host show in navisphere as sucessfully registered into the SAN?

Regards

DC

0 Kudos
Box293
Enthusiast
Enthusiast
Jump to solution

In navisphere it shows as <full server name> .

All other servers using ESX 3.5 show as <full server name>

Before performing the fresh install I power off the ESX 3.5 host, remove it from the storage groups and de-register the host in the Connections. I then disconnect the fiber cables.

When I later reconnect the host appears with the full name in Connectivity Status (in the past, without the agent, it only showed the WWNs).

VCP3 32846

VSP4 VML-306798

VCP3 & VCP4 32846 VSP4 VTSP4
0 Kudos
Box293
Enthusiast
Enthusiast
Jump to solution

Some additional information:

If I connect the fiber cables but DO NOT add it to the ESX storage group in navisphere, the rescan completes within about 10 seconds.

As soon as I add it to the storage group in navisphere and perform the rescan we get the issue (storage group is a collection of all the LUNs and ESX servers).

I tried calling Dell for technical support but they haven't been too helpful, because these are fresh installs of esx then it falls under the "new configuration" banner instead of "supporting existing running configuration".

I have looked at some logs and found the following in the vmkernel log:

Oct 19 13:38:07 BLADE010 vmkernel: 2:21:37:52.059 cpu6:4110)VMWARE SCSI Id: Id for vmhba0:C0:T0:L26

Oct 19 13:38:07 BLADE010 vmkernel: 0x60 0x06 0x01 0x60 0x5f 0xf3 0x1a 0x00 0x72 0xf8 0xf6 0x73 0x4d 0x4d 0xdd 0x11 0x52 0x41 0x49 0x44 0x20 0x35

Oct 19 13:38:07 BLADE010 vmkernel: 2:21:37:52.059 cpu0:4096)NMP: nmp_CompleteCommandForPath: Command 0x25 (0x41000208f840) to NMP device "naa.600601605ff31a0072f8f6734d4ddd11" failed on physical path "vmhba0:C0:T1:L26" H:0x0 D:0x2 P:0x0 Valid sense data: 0x6 0x29 0x0.

Oct 19 13:38:07 BLADE010 vmkernel: 2:21:37:52.059 cpu0:4096)ScsiDeviceIO: 747: Command 0x25 to device "naa.600601605ff31a0072f8f6734d4ddd11" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x6 0x29 0x0.

Oct 19 13:38:07 BLADE010 vmkernel: 2:21:37:52.150 cpu0:4096)VMNIX: VmkDev: 2249: Added SCSI device vml0:2:0 (naa.600601605ff31a0072f8f6734d4ddd11)

Oct 19 13:38:16 BLADE010 vmkernel: 2:21:38:07.222 cpu6:4110)WARNING: LVM: 4223: Volume on device naa.600601605ff31a0072f8f6734d4ddd11:1 locked, possibly because remote host naa.600601605ff31a0072f8f6734d4ddd11:1 encountered an error during a volume operation and couldn't recover.

Oct 19 13:38:31 BLADE010 vmkernel: 2:21:38:21.864 cpu7:4110)WARNING: LVM: 4223: Volume on device naa.600601605ff31a0072f8f6734d4ddd11:1 locked, possibly because remote host naa.600601605ff31a0072f8f6734d4ddd11:1 encountered an error during a volume operation and couldn't recover.

Oct 19 13:38:47 BLADE010 vmkernel: 2:21:38:38.228 cpu7:4110)WARNING: LVM: 4223: Volume on device naa.600601605ff31a0072f8f6734d4ddd11:1 locked, possibly because remote host naa.600601605ff31a0072f8f6734d4ddd11:1 encountered an error during a volume operation and couldn't recover.

Oct 19 13:38:47 BLADE010 vmkernel: 2:21:38:38.309 cpu7:4110)ScsiDevice: 1757: Successfully registered device "naa.600601605ff31a0072f8f6734d4ddd11" from plugin "NMP" of type 0

This is just a small snippet.

I remember we trialed ESX4 when it was beta and we ran into some issues so we abandoned it. I have a feeling that this problem and the previous problem are linked. We thought we would wait until it was officially released as the issues it caused us were not worth being on the bleeding edge of technology.

And the two ESX hosts that I am using to runup ESX4 on are the same physical ESX hosts I did the ESX4 beta testing on (with the same name and ip addresses). They have since been formatted, re-installed with ESX 3.5U4, connected to SAN and continued to operate fine. Not sure if ESX4 adds some locking stuff to the VMFS filesystem that needs to be manually deleted perhaps?

Any ideas?

VCP3 32846

VSP4 VML-306798

VCP3 & VCP4 32846 VSP4 VTSP4
0 Kudos
dcoz
Hot Shot
Hot Shot
Jump to solution

Box293,

I would just double check in the SAN connectivity status that the host is showing as Logged in and registered.

If this is ok then i also found this vmware KB article

Hope this helps.

regards

DC

0 Kudos
rmagoon
Enthusiast
Enthusiast
Jump to solution

It could be a problem with the HBA paths enabled within your storage group. Just delete the initiator records (hosts have to be powered off, or cables disconnected). Then dismantle and delete the storage group. After that, just have HBAs login again and register them within Connectivity Status. Once complete, create a new storage group. That should clear any path problems within the storage group configuration. As a troubleshooting measure, you should see all the LUNs when you scan within the HBA firmware.

Box293
Enthusiast
Enthusiast
Jump to solution

I have seen in the SAN connectivity status that the host has Yes in both colums Logged In and Registered (for all 4 paths).

Thanks for pointing out this artice; I am going to use it with some tips from another poster and get back with an update.

VCP3 32846

VSP4 VML-306798

VCP3 & VCP4 32846 VSP4 VTSP4
0 Kudos
Box293
Enthusiast
Enthusiast
Jump to solution

I like your idea but it's a little hard for me to bring the production environment down. However this gave me another idea.

I created another storage group and added 1 lun to it. I then assigned the ESX4 host to the group.

Performing the rescan operation now takes about 60 seconds but it does complete. The host seems slower though, and trying to browse the datastore takes about 60 seconds before the contents of the datastore is shown.

I am going to get back to you after I've tried a tip from another poster.

Thanks

VCP3 32846

VSP4 VML-306798

VCP3 & VCP4 32846 VSP4 VTSP4
0 Kudos
Box293
Enthusiast
Enthusiast
Jump to solution

It looks like we've got an answer.

By creating a new storage group and adding a lun at a time I am able to break the lock one lun at a time.

My biggest problem was when I was adding the ESX host to the existing storage group it would get all the LUNS, and it seemed to take about 60 seconds per LUN for a rescan so this would take ages for 24 LUNs.

So I would:

  • Add the LUN to the custom storage group

  • Storage Adapters - Rescan

  • Administration - System Logs - /var/log/vmkernel

  • Go to the bottom of the log and find the naa.whatever:1 text I needed

  • I copied that line into clipboard

  • I pasted into notepad

  • I copied the naa.whatever:1 text I needed

  • SSH session to the ESX host using putty

  • vmkfstools -B /vmfs/devices/disks/naa.600601605ff31a0095750c193aafdb11:1

  • Tells me it was successful

  • I go back to the vSphere client and Storage Adapters - Rescan

  • This happens instantly, problem solved.

Repeat above steps for each LUN until all are done.

Sometimes it would tell me that "Error breaking LVM device lock. Error: Busy". If I retried the command again it worked fine.

NOTE: All my LUNS have ESX 3.5u4 host connected and during this whole procedure I did not experience any stuff going wrong (there are warnings about having multiple hosts connected when performing the -B action).

If anything else pops up I'll let you ppl know but I think this is now solved, I'll update tomorrow after I've let it run for 12 hours.

VCP3 32846

VSP4 VML-306798

VCP3 & VCP4 32846 VSP4 VTSP4
0 Kudos
Box293
Enthusiast
Enthusiast
Jump to solution

Well everything has been running OK now for about 32 hours so I think this is solved. I've added another ESX4 host and all is working OK.

Thanks very much to docz and rmagoon for their input, I was able to use both suggestions to come up with a way to resolve my problem.

Cheers

Troy

VCP3 32846

VSP4 VML-306798

VCP3 & VCP4 32846 VSP4 VTSP4
0 Kudos