Yeah thats been an issue forever, had this since ESX 3.
The resolution is in that KB just lower the retries and timeout, it helps but still is painfull, But in reality you shouldnt be rebooting too often so it doesnt matter.
What I did being a large orginisation is that I made a purpose built MSCS ESX cluster so there was only a few host that were affected, and everything else sits on the main corprate cluster.
Dont know if having a support case open will accomplish anything for this problem.
Still waiting to hear from VMware Support.
I know we don't need to reboot the ESX hosts so often once it is loaded and in service. But think about the time it takes to load is you have 500+ hosts to upgrade to ESXi 5.
I do have MSCS isolated to few clusters only.
Have you tried to set the parameter mentioned in the KB?
As AARCO mentioned above, those advanced options are not available in 5.0
I see the same hang during boot under vSphere 5i on both Cisco UCS blades, HP DL360G7s and nested vSphere5i instances - all connecting to iscsi devices. In my situation some of our iSCSI SANs are not on the HCL anymore for vSphere5 and consequently when I tried to raise the issue with VMware support, support was quite limited other than to confirm that my iscsi configuration was correct. I can reproduce the issue on a nested ESX5i instance hooking up to a NexentaStor device. I suspect the issue is a generic issue with vSphere5i - VMware - please can you look at this. We are running these iscsi devices; HP MSA2000, HP MSA 2012i, NexentaStor. Boot time varies between 10 minutes and one hour depending on the configuration.
For ESX 5.0:
On the ESX hosts that are running MSCS VMs, identify LUNs exported as RDMs to VMs
For each LUN identified above, perform this configuration from the esx command line:
esxcli storage core device setconfig -d naa.<lunid> --perennially-reserved yes
The subsequent ESX reboot should no longer be slow. KB 1016106 will be updated ASAP with this information.
@ashleyw: doesen't look like you are running MS Failover Clustering are you ? Since VMware doesen't
support MSCS over iSCSI. Looks like your slow boot problem is unrelated to MSCS.
Could you please run these on the ESX command shell:
~# cd /var/run/log
~# fgrep '0xb 0x24 0x' vmkernel.log
~# for i in vmkern*gz; do gzip -cd $i | fgrep '0xb 0x24 0x' ; done
If it turns up a bunch of matches, we know this issue exists with a bunch of iscsi targets (a target
bug, not ESX).
If not, please open an SR; or just give me the SR id if you already provided vmware with full support logs.
Thanks. I will try this out and post the results later.
@kchowksey: no I'm not running MS Failover Clustering.
When I run the fgrep command it doesn't find anything. The vmkernal logs have not been gzipped yet so there are no vmkern*gz files.
the case number I attached the log files to was; 11096075809
I've attached the log file from our nested ESX5i host that shows the same "hang" at boot time connecting only to a NexentaStor box via iscsi - the "hang" time in this situation is around 4 minutes - but interestingly I see a lot of "Network is unreachable" and "iscsid: Login Failed" errors which is interesting as there are no issues with the connectivity - I see these same type of messages on our production farm as well.
update on 14/09/2011 18:45: I have removed the log file to avoid confiusion - see below..
Thanks ashley. Have forwarded your report to the right people. Suggest contacting Nexenta support too.
thanks for your help. To eliminate as much garbage as possible form the log files (as I may have appended some incorrect information), I cleared all logs and then rebooted - it took around 6 minutes on the nested esxi box... the bulk of the time was spend during the iscsi phase after vmw_satp_alua loaded successfully message on the console. On a UCS blade, this process takes aorund 15 minutes, on a DL360G7 the process takes around 30 minutes - see
I've summarised the logs as a single small attachment.
When I look closely at the vmkernel.log file I see the bulk of the time is spent in this section;
2011-09-14T04:50:40.503Z cpu0:2604)ScsiDevice: 3121: Successfully registered device "naa.600144f02aa50c0000004e640a430001" from plugin "NMP" of type 0
2011-09-14T04:50:40.524Z cpu0:2604)FSS: 4333: No FS driver claimed device 'mpx.vmhba32:C0:T0:L0': Not supported
2011-09-14T04:50:40.555Z cpu0:2604)VC: 1449: Device rescan time 20 msec (total number of devices 5)
2011-09-14T04:50:40.555Z cpu0:2604)VC: 1452: Filesystem probe time 29 msec (devices probed 5 of 5)
2011-09-14T04:50:43.471Z cpu0:2050)LVM: 13188: One or more LVM devices have been discovered.
2011-09-14T04:51:06.754Z cpu1:2604)FSS: 4333: No FS driver claimed device 'mpx.vmhba32:C0:T0:L0': Not supported
2011-09-14T04:51:06.775Z cpu1:2604)VC: 1449: Device rescan time 22 msec (total number of devices 5)
2011-09-14T04:51:06.775Z cpu1:2604)VC: 1452: Filesystem probe time 19 msec (devices probed 5 of 5)
2011-09-14T04:51:32.987Z cpu0:2604)FSS: 4333: No FS driver claimed device 'mpx.vmhba32:C0:T0:L0': Not supported
For some reason, it looks like it is repeatedly trying to access vmhba32 which appears to be the controller the CDrom device is hanging off. sigh,..
I guess this is a bug in vsphere5? Please advise.
vmwarelogs.txt 29.1 K
I managed to make a little progress on this today. To the point where the host rescan times at least have come down to a minute. Thanks to @kchowksey for some good suggestions. I noticed that my QNAP was being picked up as an ALUA array. This was in addition to the failed IO with sense data 0xb 0x24 0x0.
The claim rule I applied was as follows:
esxcli nmp satp rule add -d "<naa.deviceid>" -s "VMW_SATP_DEFAULT_AA" -o "disable_ssd"
From what I can tell this problem impacts QNAP and Netgear. I've also got OpenFiler and it didn't appear to be impacted, but I have done limited testing. Note that none of these storage systems are currently on the HCL. I believe the reason for the problem is that the iSCSI targets do not implement the t10 standards correctly. I'm going to be working with VMware support on this as well. So far the only iSCSI storage I've got that works is the HP P4000 aka Lefthand Networks VSA's with SAN/IQ9x.
1 person found this helpful
For iSCSI access to targets from vSphere 5 hosts it'll try and access every target for discovery from every vmkernel port that is bound to the initiator. It will try a number of times for each combination, until it'll finally give up and move on.