VMware Cloud Community
MichaelGi
Enthusiast
Enthusiast

Initializing SSD takes a very long time when restarting host

Does anyone know if it's normal for VSAN: Initializing SSD: to take a very long time?  It takes at least 10 minutes when the host is rebooted to get through this step.

8 Replies
elerium
Hot Shot
Hot Shot

My hosts stop here for about 5-10 minutes too (usually 5 minutes), but it does take longer sometimes. If you're at the console and press alt-f12 while it is at this step, it is doing stuff in the background so I think it's normal.

Reply
0 Kudos
zdickinson
Expert
Expert

Good afternoon, I can confirm elerium, it takes around 10 minutes for this step.  We have two 700 GB PCIe in each host.  Thank you, Zach.

Reply
0 Kudos
srodenburg
Expert
Expert

Same here. 200GB SAS SSD takes about 5 to 6 Minutes.

Update August 12th 2017:  just upgraded a 8-node cluster from 6.5 to 6.5 U1 and the init.time on all nodes is different. Some where done initialising after say 5 to 6 minutes. Others took 10 or even up to 25 minutes. Alles nodes are identical.

Does anyone know what it is actually doing?

Reply
0 Kudos
vBlackCat
Contributor
Contributor

I imagine that it is never too late to post an answer so YES, this is normal as per this KB: https://kb.vmware.com/s/article/2149115

VMware Knowledge Base

And yes, this is really annoying when you have to update a cluster but I think we'll have to deal with it Smiley Happy

Reply
0 Kudos
kabanossi
Enthusiast
Enthusiast

I'm lucky I found this thread. I have noticed a similar behavior with vSAN on DL380 gen9 AF ready nodes. Initialization takes about 6-8 mins in my case.

Looks like normal from what I read.

Reply
0 Kudos
arjanhs
Enthusiast
Enthusiast

I takes more then 30 minutes a host over here, so when updating a driver and doing a restart for say 20 hosts it takes 10 hours before the task is done.

Reply
0 Kudos
vpradeep01
VMware Employee
VMware Employee

Yes this is normal. The KB article mentioned above should be helpful.

Just an additional info:

This behaviour is also documented in the "Essentials vSAN - Administrators Guide to vSAN" by Cormac Hogan.

Page 58

Component Management

The VSAN Local Log Structured Object Manager (LSOM) works at the physical disk level. It is the LSOM that provides for the storage of VM storage

object components on the local disks of the ESXi hosts, and it includes both the read caching and write buffering for these objects. When we talk in terms of

components, we are talking about one of the striped components that make up a RAID-0 configuration, or one of the replicas that makes up a RAID-1

configuration. Therefore, LSOM works with the magnetic disks and solid-state disks (SSDs) on the ESXi hosts. To recap, the SSDs are used as a cache

and a nonvolatile write buffer in front of the magnetic disks.

Another way of describing the LSOM is to state that it is responsible for providing persistence of storage for the VSAN cluster. By this, we mean that it

stores the components that make up VM storage objects as well as any configuration information and the VM storage policy.

LSOM reports events for these devices, for example, if a device has become unhealthy. The LSOM is also responsible for retrying I/O if transient device

errors occur.

LSOM also aids in the recovery of objects. On every ESXi host boot, LSOM performs an SSD log recovery. This entails a read of the entire log that

ensures that the in-memory state is up to date and correct. This means that a reboot of an ESXi host that is participating in a VSAN cluster can take longer

than an ESXi host that is not participating in a VSAN cluster.

Reply
0 Kudos
TheBobkin
Champion
Champion

Hello,

Just to add to what vpradeep01 mentioned from that book:

This Log data is not flushed to capacity-tier on host reboot and thus the Log recovery time (and thus reboot time) correlates with the amount of Log data on the cache-device - my understanding of this is that this data also needs to be compared to the current state of the data as data will change on the rest of the cluster while the node is absent and thus if you have a high rate of change on the cluster then this will add to the time needed in SSD initialization (as will unsupported things such as increasing the default assigned high and low Log-capacity levels).


Bob