I am seeing boot times in excess of 20 minutes. The boot process hands just after loading "vmw_vaii ip_hds" for over 12 minutes, not sure if after loading vmw_vai ip_hds whether its executing some activities based on this module, or whether it is trying to load the next service, with is "gss". I haven't been able to find details as to what gss is. Other than this 1 hang, the boot is fairly smooth..
I have read this may be because of the use of RDMs, but I'm not completely convinced.
Can anyone shed light on what could be causing vSphere to take so long to boot, and how and if it can be mitigated in any way.
Was the boot time always this slow or has it slowed down recently?
If it's a new issue in your environment, then this is certainly caused by RDMs. You can follow the guide here VMware Knowledge Base to troubleshoot this
Was the boot time always this slow or has it slowed down recently?
If it's a new issue in your environment, then this is certainly caused by RDMs. You can follow the guide here VMware Knowledge Base to troubleshoot this
When the process hangs at "vmw_vaii ip_hds", it indicates that it stopped at this process or the next process is still ongoing. You can press Alt F12 on DCUI console and then check live vmkernel logs on the activities but most of the time it might be stuck due to storage scan process and is explained in KB 1016106. I am sure that the live vmkernel log gives you some light or answers you are looking for on the slow boot issue
Thanks,
MS
OK, I looked at the KB - horribly written, but clearly pointing to applying a setting to the LUN to make it "--perennially-reserved=true"
I say the KB is horribly written because it doesn't really explain much about what this setting is doing, and what other ramifiations might exist as a result of making the change to --perennially-reserved=true
Before I apply this change to a production environment can you shed light on this change? What its doing during boot and what the ramifications might be after the system is running live.
Also, correct me if I'm wrong, but it looks as though this command needs to be run on every host in the cluster, and also for every RDM, is this the case?
Just tested our older hosts that see the same storage, the boot time is the same, very long, about 20 minutes. (current prod is running on 5.5, just build out a new cluster of 6.5 and getting ready to migrate all workloads to it, both clusters see the same LUNs at this time until migrations are completed)
Followed link to KB, see below.
Yes. The changes should be made on all hosts and for all the RDM's. You might also consider upgrading the drivers/firmware for the ESXi hosts for hba's or nic (is using iscsi) as per the compatibility guide.
Thanks,
MS
I just ran a script to collect RDM info
added the info collected into a text file compiled a string of esxcli commands
ran the commands in the vSphere cli against a host that doesn't run any VMs, but it does see the LUNs used as RDMs since it belongs to the same storage group as all hosts in the cluster
everything was successful after dealing with the thumbprint issue connecting to the host
rebooted - wow the boot was so fast my head is still spinning
I will run this across all of my new hosts, but I am not going to run it against the older hosts where the VMs currently run that own the RDMs.
Once all the VMs are migrated to the new cluster, will there by any impact due to the perennially reserved setting on the VMs that own the RDMs or the hosts that house the VMs that own the RDMs?
Thanks