VMware Cloud Community
TAE
Contributor
Contributor

Long startup for 4.1 ESXi host connected to NetApp storage

Anyone out there running ESXi 4.1 (vSphere Hypervisor)? Anyone out there running ESXi 4.1 connected to a NetApp SAN that has their host take for-freaking-EVER to start up? Seriously - I'm pushing 40 minutes and the host is not up yet. It sat for a long time (30 monutes) at: "vmw_vaaip_netapp loaded successfully" which is why I'm asking the "NetApp" question. This host has the disk provisioned, but there are no VM's running on it currently.

By comparison, I have another ESXi 4.1 host connected to the same SAN, but no storage has yet been provisioned (un-masked) for it (it doesn't see any LUNs). The "problem" host was a host that was running ESX 4.0 U1 which I've reprovisioned under ESXi 4.1. It never had any guests on it, so I thought I'd save myself the time later and rebuild it now. And yes, the host was built with the fiber disconnected from the HBA's. The good news is that the host does eventually come up and participate in the cluster and it sees the disk, but the amount of time it's taking to start up has me baffled. I'm expecting that when the storage is provisioned to the "happy" ESXi host, I'll see the same behavior.

Anyone have any thoughts on this?

Tags (2)
Reply
0 Kudos
23 Replies
TAE
Contributor
Contributor

Just an update and a shameless (bump) to get this a little more visibility. I'm guessing I'll have to open a case with VMware support, but I'm sure they're getting tired of me bugging them. Anyway, I monitored a boot of a host that's connected to back-end SAN and has disk provisioned to it. Here's the cronological events:

0:00 - Issue restart command

1:00 - POST

2:00 - Begin ESXi boot (Loading VMware Hypervisor)

3:00 - message; loading mptsas

4:00 - message; multiextent loaded successfully

- message; vmw_vaaip_netapp loaded successfully

41:00 - message; vmkibft loaded successfully

52:00 - message; cbt loaded successfully, Running usbarbitrator start

61:00 - ESXi login screen (vCenter reports server is online)

Granted, times are approximates based on when I noticed the message change on my alternate monitor.

Thoughts?

Reply
0 Kudos
golddiggie
Champion
Champion

You're booting from local disks correct? I would try reinstalling ESXi 4.1 on that host, with all hardware installed. I would make sure to wipe the drive(s) first just to make sure there's no lingering items causing issues. If you are booting from local disk(s) is it a single drive or a RAID array (such as RAID 1)? Which model server is this? Is the firmware up to date, or the level listed for ESXi 4.1 on the HCL for that server make/model?

Network Administrator

VMware VCP4

Consider awarding points for "helpful" and/or "correct" answers.

Reply
0 Kudos
ian4563
Enthusiast
Enthusiast

I have a similar problem with ESX 4.1 and EMC Clariion storage. Every host that has ESX 4.1 installed hangs for ~30 minutes on "Starting Path Claiming and SCSI Device Discovery". The ESX 4.0 hosts are zoned into the fabric exactly the same way but don't have the issue. I was wondering if it had to do with ESX trying to determine storage acceleration for each LUN.

Reply
0 Kudos
Andy_Banta
Hot Shot
Hot Shot

Can you (or Ian) post any part of the messages file while this is happening?

/var/log/messages

Or the appropriate archived log, such as /var/log/messages.1.gz?

Those can show some of what's going on during that time.

Andy

Reply
0 Kudos
TAE
Contributor
Contributor

Attached are the messages files that I pulled down from vCenter (Export System Logs) shortly after the host came online. While I'm not really sure what I'm looking for, a skim/search of the logs for "VMW_VAAIP_NETAPP" reveals glut of logged events in the "messages.1" file. Hopefully they'll reveal something helpful.

Reply
0 Kudos
TAE
Contributor
Contributor

Good call - but no luck. Updated the machine to the current firmware and no change in the start-up performance. The server(s) are Dell PowerEdge R710's. I'm hesitant to re-install with the HBA's connected to the SAN, as it's totally against the VMware recommendation. All the hardware was in place when I ran the installation; the HBA's were not added afterwards, for example.

Regarding your other points, yes: Local disk, RAID1. All hardware is on the HCL and I cannot find any footnotes that would prove to be a "gotcha" in terms of version.

I've attached the "messages" logs to this thread which may prove helpful.

Reply
0 Kudos
isokolin
Contributor
Contributor

What is the ONTAP version on your NetApp array?

I looked at the logs you provided, and I see the folloing

1) You have a fairly large number (22?) of the NetApp LUNs mapped to this ESX host

2) When claming (discovering) LUNs ESX 4.1 now issues several new commands as compared to ESX4.0.x

One of these new commands is MODE_SENSE (0x1a) of the page 0x0a ( control mode) whichis used to detect a QErr parameter.

It looks like this comand fails with after a 40 second timeout.

Aug 4 21:13:37 vmkernel:

Aug 4 21:13:37 vmkernel: 0:00:01:13.312 cpu9:4724)VMWARE SCSI Id: Id for vmhba3:C0:T0:L99

Aug 4 21:13:37 vmkernel: 0x60 0xa9 0x80 0x00 0x57 0x2d 0x50 0x4c 0x56 0x4a 0x57 0x33 0x4a 0x64 0x44 0x67 0x4c 0x55 0x4e 0x20 0x20 0x20

Aug 4 21:14:17 vmkernel: 0:00:01:53.354 cpu13:4109)ScsiDeviceIO: 1672: Command 0x1a to device "naa.60a98000572d504c564a57334a644467" failed H:0x5 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.

Aug 4 21:14:17 vmkernel: 0:00:01:53.354 cpu9:4724)ScsiDeviceIO: 4494: Could not detect setting of QErr for device naa.60a98000572d504c564a57334a644467. Error Failure.

So if you have about 22 luns, and you get a 40 second timeout for every LUN, it will take you 14 minutes.

3) In addition in some cases READ_CAPACITY (0x25) commands also fails after 40 second timeout, adding to the overall boot time.

Aug 4 21:26:18 vmkernel: 0:00:13:54.198 cpu8:4724)VMW_VAAIP_NETAPP: netapp_claim_device: Device naa.60a98000572d504c654a566c75522d43 (NETAPP :LUN ) not supported

Aug 4 21:26:58 vmkernel: 0:00:14:34.238 cpu14:4110)ScsiDeviceIO: 1672: Command 0x25 to device "naa.60a98000572d504c654a566c75522d43" failed H:0x5 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.

Aug 4 21:26:58 vmkernel: 0:00:14:34.238 cpu8:4724)WARNING: ScsiCore: 1399: Invalid sense buffer: error=0x0, valid=0x0, segment=0x0, key=0x0

We have not seen such behavier with the NetApp ONTAP versions that we are tesging with.

Therefore, I would suggest that you consider upgrading the NetApp ONTAP firmware.

Reply
0 Kudos
TAE
Contributor
Contributor

We are running Data OnTap version 7.3.2P4 on a FAS3140. Based on the research through the HCL's, this is a supported level. Also, if it matters, I'm running QLogic QLE2640 FC HBAs to connect to the storage.

Regarding these "new commands", is it a fair assumption that "READ_CAPACITY" is also one of those commands? It certainly makes sense that if these are timing out after a fair delay, that would explain why it's taking so long to restsart.

Also, you state that 22 LUNs is "fairly large"? I was under the impression that each ESX host had an upper limit of 255 LUNs mapped. I'm only at 10% of the maximum, so I'm a little confused over this comment.

I certainly don't mean to sound confrontational on this matter, but I put a good amount of time in making sure everything was supported before I recommended that we move to vSphere Hypervisor 4.1. For me to back to the storage guys (and my manager) telling them that they need to update Data OnTap to a new version, I'm going to look like quite the idiot. (Well, moreso than normal.)

What versions of Data OnTap have you tested with?

Reply
0 Kudos
isokolin
Contributor
Contributor

1) READ_CAPACITY was there forewer, which makes this even stranger.

2) 22 LUNs is perfectly normal. All I menat to say, is if you had only let say 3 LUNs the extra boot time would only be 5-6 minutes.

3) ONTAP 7.3 should be fine, but I don't know anything about particular patch releases.

In any case, you need to take this info (logs, my comment, ONTAP version, and even better vm-support) to VMware support, and may be NetApp support also.

This seems to be a real inter-oprability issue between ESX 4.1 and at least this particular version of Ontap and is has to go through the proper support channels.

Reply
0 Kudos
TAE
Contributor
Contributor

Will do. I'll post any relevant results here.

Reply
0 Kudos
ilion1
Contributor
Contributor

Hello,

I just ran into a similar issue. Anything after "CBT loaded successfully" was very slow. After I rebooted the NAS system providing iSCSI storage the ESXi host startup was normal again. Upon the next reboot the ESXi host would not reconnect to the storage. The storage system however showed an active connection from the host. Eventually I noticed that the iSCSI connection is still active when the ESXi host is shut down or rebooted. This means that after a reboot the ESXi host times out when trying to reconnect. This results in the delay at "CBT loaded successfully" and anything after that. Whenever I disconnect the connection on the storage system manually after the host is shut down the next start will be fine.

To me it seems that there is an error in the shutdown process of ESXi 4.1 not disconnecting from iSCSI storage properly.

Reply
0 Kudos
RParker
Immortal
Immortal

This means that a LUN may not be reading correctly. It could be the host. Try a rescan of ALL HBA on that host (just click the host and rescan, it will do them all at once.)

If that doesn't work, since you have shared storage, migrate the VM and templates off, rebuild that host from scratch (delete local disk, recreate, etc..)

Reply
0 Kudos
RParker
Immortal
Immortal

http://communities.vmware.com/thread/278792?tstart=0

It also appears to be others as well

Reply
0 Kudos
TAE
Contributor
Contributor

Since there's been some recent activity, here's a quick update with my research thus far:

- opened a case with VMware and NetApp; no glaring "oh, here's your problem, just do this" which is a bit frustrating.

- NetApp confirmed that the LUNs are connected properly.

- VMware noted that the delay was only on specific LUNs, which accounts for the long start-up time.

- referencing the LUNs on my side of the fence, the problem only seems to be affecting the LUNs that are used as RDMs that are part of a Microsoft cluster. What's more, all these MSCS RDMs are hosted on ESX (not ESXi) servers running 4.0U1.

- I've not tried moving the MSCS guests to the new ESXi hosts.

- I've not tried masking the LUNs from the affected servers to see if startup times improve

- I do have stand-alone servers using RDMs, and it does not seem that they are throwing errors.

- On the ESXi 4.1 servers affected by this problem, vCenter is showing that the LUN is reporting a capacity of "0 bytes"

- I have one ESX 4.0U1 server that is actually demonstrating the same problem.

- Two ESX 4.0U1 servers are demonstrating the "0 byte" problem on a single LUN.

In vCenter, when checking the storage array type, it is set to "VMW_SATP_ALUA"; this is true for both "Most Recently Used (VMware)" and "Round Robin (VMware)". I do notice that with ESXi 4.1, however, there is another setting "VMW_PSP_FIXED_AP".

Below is a snip from the message log demonstrating what I believe is the problem (sorry for being so ugly):

11-Aug 14:25:29 vmkernel: 0:00:03:16.553 cpu3:4724)vmw_psp_mru: psp_mruSelectPathToActivateInt: Changing active path from NONE to vmhba2:C0:T0:L98 for device "Unregistered Device".

11-Aug 14:25:29 vmkernel:

11-Aug 14:25:29 vmkernel:

11-Aug 14:25:29 vmkernel: 0:00:03:16.553 cpu3:4724)VMWARE SCSI Id: Id for vmhba2:C0:T1:L98

11-Aug 14:25:29 vmkernel: 0x60 0xa9 0x80 0x00 0x57 0x2d 0x50 0x4c 0x65 0x4a 0x57 0x32 0x36 0x63 0x6b 0x6c 0x4c 0x55 0x4e 0x20 0x20 0x20

11-Aug 14:25:29 vmkernel:

11-Aug 14:25:29 vmkernel:

11-Aug 14:25:29 vmkernel: 0:00:03:16.554 cpu3:4724)VMWARE SCSI Id: Id for vmhba2:C0:T0:L98

11-Aug 14:25:29 vmkernel: 0x60 0xa9 0x80 0x00 0x57 0x2d 0x50 0x4c 0x65 0x4a 0x57 0x32 0x36 0x63 0x6b 0x6c 0x4c 0x55 0x4e 0x20 0x20 0x20

11-Aug 14:25:29 vmkernel:

11-Aug 14:25:29 vmkernel:

11-Aug 14:25:29 vmkernel: 0:00:03:16.554 cpu3:4724)VMWARE SCSI Id: Id for vmhba3:C0:T1:L98

11-Aug 14:25:29 vmkernel: 0x60 0xa9 0x80 0x00 0x57 0x2d 0x50 0x4c 0x65 0x4a 0x57 0x32 0x36 0x63 0x6b 0x6c 0x4c 0x55 0x4e 0x20 0x20 0x20

11-Aug 14:25:29 vmkernel:

11-Aug 14:25:29 vmkernel:

11-Aug 14:25:29 vmkernel: 0:00:03:16.554 cpu3:4724)VMWARE SCSI Id: Id for vmhba3:C0:T0:L98

11-Aug 14:25:29 vmkernel: 0x60 0xa9 0x80 0x00 0x57 0x2d 0x50 0x4c 0x65 0x4a 0x57 0x32 0x36 0x63 0x6b 0x6c 0x4c 0x55 0x4e 0x20 0x20 0x20

11-Aug 14:26:09 vmkernel: 0:00:03:56.569 cpu0:4112)ScsiDeviceIO: 1672: Command 0x1a to device "naa.60a98000572d504c654a573236636b6c" failed H:0x5 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.

11-Aug 14:26:09 vmkernel: 0:00:03:56.570 cpu3:4724)ScsiDeviceIO: 4494: Could not detect setting of QErr for device naa.60a98000572d504c654a573236636b6c. Error Failure.

11-Aug 14:26:09 vmkernel: 0:00:03:56.570 cpu3:4724)VMW_VAAIP_NETAPP: netapp_claim_device: Device naa.60a98000572d504c654a573236636b6c (NETAPP :LUN ) not supported

11-Aug 14:26:49 vmkernel: 0:00:04:36.572 cpu0:4112)ScsiDeviceIO: 1672: Command 0x25 to device "naa.60a98000572d504c654a573236636b6c" failed H:0x5 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.

11-Aug 14:26:49 vmkernel: 0:00:04:36.572 cpu3:4724)WARNING: ScsiCore: 1399: Invalid sense buffer: error=0x0, valid=0x0, segment=0x0, key=0x0

11-Aug 14:27:15 vmkernel: 0:00:05:02.440 cpu8:4228)VMW_VAAIP_NETAPP: netapp_claim_device: Device naa.60a98000572d504c564a57334a644467 (NETAPP :LUN ) not supported

11-Aug 14:27:15 vmkernel: 0:00:05:02.440 cpu8:4228)VMW_VAAIP_NETAPP: netapp_claim_device: Device naa.60a98000572d504c654a573236636b6c (NETAPP :LUN ) not supported

11-Aug 14:27:29 vmkernel: 0:00:05:16.605 cpu15:4111)ScsiDeviceIO: 1672: Command 0x1a to device "naa.60a98000572d504c654a573236636b6c" failed H:0x5 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.

11-Aug 14:27:29 vmkernel: 0:00:05:16.605 cpu3:4724)WARNING: ScsiDeviceIO: 5172: READ CAPACITY on device "naa.60a98000572d504c654a573236636b6c" from Plugin "NMP" failed. I/O error

11-Aug 14:27:29 vmkernel: 0:00:05:16.605 cpu3:4724)WARNING: ScsiDevice: 1726: Full GetDeviceAttributes during registration of device 'naa.60a98000572d504c654a573236636b6c': failed with I/O error

11-Aug 14:27:29 vmkernel: 0:00:05:16.605 cpu3:4724)ScsiDevice: 2200: Successfully registered device "naa.60a98000572d504c654a573236636b6c" from plugin "NMP" of type 0

And a snip for a LUN that doesn't seem to throw any errors, but is also mapped as an RDM to a standalone server:

11-Aug 14:43:30 vmkernel: 0:00:21:17.924 cpu0:4724)vmw_psp_mru: psp_mruSelectPathToActivateInt: Changing active path from NONE to vmhba2:C0:T0:L108 for device "Unregistered Device".

11-Aug 14:43:30 vmkernel:

11-Aug 14:43:30 vmkernel:

11-Aug 14:43:30 vmkernel: 0:00:21:17.924 cpu0:4724)VMWARE SCSI Id: Id for vmhba3:C0:T1:L108

11-Aug 14:43:30 vmkernel: 0x60 0xa9 0x80 0x00 0x57 0x2d 0x50 0x4c 0x65 0x4a 0x58 0x49 0x67 0x32 0x6c 0x44 0x4c 0x55 0x4e 0x20 0x20 0x20

11-Aug 14:43:30 vmkernel:

11-Aug 14:43:30 vmkernel:

11-Aug 14:43:30 vmkernel: 0:00:21:17.925 cpu0:4724)VMWARE SCSI Id: Id for vmhba2:C0:T0:L108

11-Aug 14:43:30 vmkernel: 0x60 0xa9 0x80 0x00 0x57 0x2d 0x50 0x4c 0x65 0x4a 0x58 0x49 0x67 0x32 0x6c 0x44 0x4c 0x55 0x4e 0x20 0x20 0x20

11-Aug 14:43:30 vmkernel:

11-Aug 14:43:30 vmkernel:

11-Aug 14:43:30 vmkernel: 0:00:21:17.925 cpu0:4724)VMWARE SCSI Id: Id for vmhba3:C0:T0:L108

11-Aug 14:43:30 vmkernel: 0x60 0xa9 0x80 0x00 0x57 0x2d 0x50 0x4c 0x65 0x4a 0x58 0x49 0x67 0x32 0x6c 0x44 0x4c 0x55 0x4e 0x20 0x20 0x20

11-Aug 14:43:30 vmkernel:

11-Aug 14:43:30 vmkernel:

11-Aug 14:43:30 vmkernel: 0:00:21:17.925 cpu0:4724)VMWARE SCSI Id: Id for vmhba2:C0:T1:L108

11-Aug 14:43:30 vmkernel: 0x60 0xa9 0x80 0x00 0x57 0x2d 0x50 0x4c 0x65 0x4a 0x58 0x49 0x67 0x32 0x6c 0x44 0x4c 0x55 0x4e 0x20 0x20 0x20

11-Aug 14:43:30 vmkernel: 0:00:21:17.925 cpu0:4724)ScsiDeviceIO: 4510: QErr is correctly set to 0x0 for device naa.60a98000572d504c654a584967326c44.

11-Aug 14:43:30 vmkernel: 0:00:21:17.926 cpu0:4724)VMW_VAAIP_NETAPP: netapp_claim_device: Device naa.60a98000572d504c654a584967326c44 (NETAPP :LUN ) not supported

11-Aug 14:43:30 vmkernel: 0:00:21:17.933 cpu0:4724)ScsiDevice: 2200: Successfully registered device "naa.60a98000572d504c654a584967326c44" from plugin "NMP" of type 0

More to come, I'm sure.

Reply
0 Kudos
isokolin
Contributor
Contributor

Aha, these are MSCS RDMs ...

MSCS master node keeps these LUNs always in SCSI-reserved state.

Therefore, other nodes can not execute commands like READ_CAPACITY and MODE_SENSE can not be executed.

This problem exists in ESX4.0 also.

This is not array specific. There is not difference between ESX and ESXi in this respect.

I am not sure what can be done here, except for masking the MSCS LUN from ESX hosts which do not need to see them.

Ask VMware support if there are any special settings for MSCS.

Reply
0 Kudos
whynotq
Commander
Commander

I have seen long startup issues many times, these are usually associated with FS locking and normally MSCS appears in the configuration somewhere, what have had to do in the past is to move the MSCS to a seperate VM in different Storage Group (EMC) i guess this relates to iGroup on the NetApp. Once the MSCS LUNs were removed from visibility of my production cluster the boot time returned to normal.

The only other time i have seen similar issues is with a bus that has no LUN 0 presented, in fact i am looking into this currently on a NetApp site which brought me to your post. WHat i see in some instances is the Controller has a target but no storage bound, in FC this is overcome by the PSEUDO device that is presented by the Staorage Vendor and maps as LUN 0 but in my iSCSI here this is not the case and may also be for you, i am not 100% sure this is my or your problem but it is something i am working on currently to eliminate from my issues. Any idea how i present a LUN with an ID of say 90 as LUN 0? easy in EMC (ALU- HLU) is there a cross mapping in NetApp?

Reply
0 Kudos
TAE
Contributor
Contributor

Well it's been determined that the slow startup of ESXi in my environment all ties back to MSCS and RDMs used for the shared disk. Since these LUNs are locked from access, ESXi tries to get information from them for about 180 seconds, times out, and moves on. Apparently this is a new "feature" in version 4.1, and there's even a "known issue" associated with it: (source)

Persistent reservations conflicts on shared LUNs might cause ESX and ESXi hosts to boot longer

You might experience significant delays while booting your hosts that share LUNs on a SAN. This could be because conflicts between the LUN SCSI reservations.

Workaround: To resolve this issue and speed up the boot process, change the timeout for synchronous commands during boot time to 10 seconds. You can do this by setting the Scsi.CRTimeoutDuringBoot parameter to 10000.

To modify the parameter from the vSphere Client:

1.In the vSphere Client inventory panel, select the host, click the Configuration tab, and click Advanced Settings under Software.

2.Select SCSI.

3.Change the Scsi.CRTimeoutDuringBoot value to 10000.

I have successfully tested a value of 10000 and 5000 (current setting) and in both cases the startup time of the ESXi 4.1 host improved. With the current setting, the boot time only (only) takes 30 minutes before vCenter reports the host as being available. Feedback from support provided information that this setting only applies during the boot process, and is not referenced after the host has started fully. Setting the value too low may result in LUNs not being properly detected and therefore not visible in the environment.

This option - Scsi.CRTimeoutDuringBoot - does not exist in 4.0U1, and probably 4.0U2 as well. The response that I received from support is that this is not listed as a bug, and will probably become a KB article. Personally, I find this a little odd since a 30-minute boot process isn't what I would consider acceptable. This may be a means by which VMware shuts down the use of MSCS with vSphere, pushing users more towards FT. You're still welcome to use MSCS, but there's a price. Of course, the other alternative would be to mask the LUNs used by MSCS from specific hosts and limit where you place your guest machines.

Anyway, I'm going to mark this question answered and hope it helps someone else. Thanks to everyone for their feedback.

Reply
0 Kudos
dcsteam
Contributor
Contributor

Just to let everyone who commented on this thread know, this is now being recognised by VMware as a bug (they have a PR open with multiple SRs attached).

We have seen the exact same problems with hosts running MSCS clusters with physical mode RDMs, but our experience has been far worse than those reported on this thread.  One of our hosts still hadn't booted after 9 hours when tested the other day!

Currently they are talking about a fix in Q3 2011 (!) but we a pressing them on that as for us that's just unacceptable.

Reply
0 Kudos
TAE
Contributor
Contributor

Thanks for the update!  When I had initially opened this case with VMware, the engineer was saying that this would likely be nothing more than a KB article.  Glad to hear that they've had a change of heart/opinion.  Doesn't do much for you, unfortunately...  9 hour boot.  Owie.

Smiley Sad

Reply
0 Kudos