Does anyone know if it's normal for VSAN: Initializing SSD: to take a very long time? It takes at least 10 minutes when the host is rebooted to get through this step.
My hosts stop here for about 5-10 minutes too (usually 5 minutes), but it does take longer sometimes. If you're at the console and press alt-f12 while it is at this step, it is doing stuff in the background so I think it's normal.
Same here. 200GB SAS SSD takes about 5 to 6 Minutes.
Update August 12th 2017: just upgraded a 8-node cluster from 6.5 to 6.5 U1 and the init.time on all nodes is different. Some where done initialising after say 5 to 6 minutes. Others took 10 or even up to 25 minutes. Alles nodes are identical.
Does anyone know what it is actually doing?
I'm lucky I found this thread. I have noticed a similar behavior with vSAN on DL380 gen9 AF ready nodes. Initialization takes about 6-8 mins in my case.
Looks like normal from what I read.
Yes this is normal. The KB article mentioned above should be helpful.
Just an additional info:
This behaviour is also documented in the "Essentials vSAN - Administrators Guide to vSAN" by Cormac Hogan.
The VSAN Local Log Structured Object Manager (LSOM) works at the physical disk level. It is the LSOM that provides for the storage of VM storage
object components on the local disks of the ESXi hosts, and it includes both the read caching and write buffering for these objects. When we talk in terms of
components, we are talking about one of the striped components that make up a RAID-0 configuration, or one of the replicas that makes up a RAID-1
configuration. Therefore, LSOM works with the magnetic disks and solid-state disks (SSDs) on the ESXi hosts. To recap, the SSDs are used as a cache
and a nonvolatile write buffer in front of the magnetic disks.
Another way of describing the LSOM is to state that it is responsible for providing persistence of storage for the VSAN cluster. By this, we mean that it
stores the components that make up VM storage objects as well as any configuration information and the VM storage policy.
LSOM reports events for these devices, for example, if a device has become unhealthy. The LSOM is also responsible for retrying I/O if transient device
LSOM also aids in the recovery of objects. On every ESXi host boot, LSOM performs an SSD log recovery. This entails a read of the entire log that
ensures that the in-memory state is up to date and correct. This means that a reboot of an ESXi host that is participating in a VSAN cluster can take longer
than an ESXi host that is not participating in a VSAN cluster.
Just to add to what vpradeep01 mentioned from that book:
This Log data is not flushed to capacity-tier on host reboot and thus the Log recovery time (and thus reboot time) correlates with the amount of Log data on the cache-device - my understanding of this is that this data also needs to be compared to the current state of the data as data will change on the rest of the cluster while the node is absent and thus if you have a high rate of change on the cluster then this will add to the time needed in SSD initialization (as will unsupported things such as increasing the default assigned high and low Log-capacity levels).