VMware Cloud Community
jkasalzuma
Enthusiast
Enthusiast

Update VSAN configuration taking a long time

Has anyone else experienced the "Update VSA Configuration" tasks that runs after enabling vSAN on the hosts taking a long time?

So far my tasks are running against 3 hosts and have taken 45 minutes thus far and still going. I don't remember it running this long in the beta. I am running the 5.5up1 bits on a fresh build that I wiped disks prior to rebuilding and this is a brand new 5.5up1 vCenter Appliance. There is nothing left of the beta.

It hasn't timed out after 45 minutes so I'm guessing that's a good thing.

I see no activity on the disks either so not sure if they are being zero'd or something along that line.

Any ideas? I'll update this post if something changes...

53 Replies
jkasalzuma
Enthusiast
Enthusiast

So it appears the data in my Webclient was stale despite clicking refresh. The C# client shows the tasks timed out after exactly 30 min. Tryed it again in manual mode (tried Auto prior) and the management agents stopped responding. Had to restart then via ssh.

I'm guessing something in my lab isn't right if its not working correctly. But the odd thing is its the same symptoms across all 3 hosts. Maybe its vCenter?

jkasalzuma
Enthusiast
Enthusiast

OK, I may have found the solution for my issue.

I was attempting to enable vSAN on an existing cluster with hosts that had load. In the beta I am pretty sure I was able to enable vSAN on an existing cluster containing hosts with load and NOT in Maintenance mode but maybe my memory eludes me.

Either way, the solution to my issue was this...

1. Restart management agents on all 3 hosts in my cluster (hostd & vpx) as they were not responding, ssh was responding however.

2. Create a new cluster with DRS, HA and vSAN enabled.

3. Place one Host into maint mode.

4. Move her to the new cluster, cluster config worked just fine in under 30 sec.

5. Removed her from Maint mode

6. Moved some VMs to the first hos in the new cluster

7. Put second into maintence mode

8. Move her to the new cluster

9. Rise, lather repeat until all hosts and VMs were in the new vSAN enable cluster.

Now I'm getting an error on all host disk groups and they show "Unhealthy" with "Dead or Error" on all magnetic disks in the details.

The mag disks don't even show under any of the 3 host's storage devices. Weird stuff for sure...

I remember this much easier in the beta for some reason...

SolidCactus
Contributor
Contributor

Hi,

Thanks for this. I can confirm exactly the same symptoms from my side.

I have followed your instructions above and confirm I have gotten to the same scenario as yourself.


I'm getting an error on all host disk groups and they show "Unhealthy" with "Dead or Error".


I'm using the AHCI driver in my setup and this exact same setup in terms of hardware worked fine through the beta period. I'm using the vCSA.

Have you had any luck in getting your hosts to recognize your magnetic drives? I'm going to test a different number of magnetic drives and a different drive to see if I can get any other behaviour.

Not really sure what's going on here as this worked very well in the beta and since rolling the GA bits I have had nothing but issues in the past few days since it's release.

Any ideas would be greatly appreciated!

jkasalzuma
Enthusiast
Enthusiast

Hey SolidCactus,

No zero luck thus far. Its actually gotten worst.

I tried rebooting a single host to see if the HDD would comeback and now the host can't even be managed via ssh. Its pingable, but no management of any kind. My lab is at my office so I havn't been able to check the monitor for a PSOD.

So don't reboot your hosts!

I'm starting to follow the lead that I hit earlier were things began to work when I had zero VM load on the hosts. I'm dusting off my old ML110G6 to move all VMs off these vSAN hosts (or hopefully future vSAN) and trying again. I may rebuild the hosts yet again for a vanilla build (gparting all the disks too).

I would reccomend to you if your hosts are all communicating with vCenter to disable vSAN and hopfully you would run into the same issues I did with hosts going management dark. I'd be interested in knowing after disabling vSAN and rebooting a host if you have any issues.

FYI I am rolling a similar Lab as Erik Bussink but with i5's and no mSATA. I am using USB to boot from.

Homelab with vSphere 5.5 and VSAN | Erik Bussink

Jkasal

SolidCactus
Contributor
Contributor

Hi Jkasal,

Thanks for the quick response! At the moment I have now disabled vSAN and I'm back to running over an iSCSI setup. I would really like to get vSAN running and test out the GA build.

When first tried to set it up vSAN on the GA bits I lost all connectivity to the hosts as you described. I then tried rebooting the hosts as I was unsure of what management agents to reset. None of the hosts managed to successfully boot within an hour and a half. I hopped on the remote management of the hosts and they were all stuck on "usbarbitrator start"


I was unable to do anything other with the hosts than rebuild again with the GA bits again. I would be interested if you had the same issue when you are able to see your hosts again?


I thought I had messed up with the vSAN configuration as everything worked as expected in the beta setup. It's good to know that I'm not alone!

Looking at the link you pinged across it looks like you are running vSAN on the AHCI driver as well.


Is anyone else having luck with running vSAN GA build on the AHCI driver? If so, any tricks or tips?

Thanks,

SolidCactus!

Reply
0 Kudos
jkasalzuma
Enthusiast
Enthusiast

Ya I'll take a look at what the screen shows bit later today and let you know.

Out of curiosity, did you have any VMs running on your hosts when you enabled vSAN?

Reply
0 Kudos
jkasalzuma
Enthusiast
Enthusiast

So looking at the Monitor of my ESXi host that didn't come back up and it appears it never shuttdown completely.

It is stuck at "Shutting down VSAN IO layer...", "Running vsantraced stop".

She would not respond to any keyboard commands. Had to hard power her down.

I will be rebuilding my hosts and trying to enable vSAN all over again with no load at all on my hosts. I'll see where that gets me.

Update:

Prior to testing my luck with the vSAN setup again, I investigated what SolidCactus was talking about with AHCI .

I did a little investigation and found this gentleman's thread. VMware Front Experience: How to make your unsupported SATA AHCI Controller work with ESXi 5.5

After researching my AHCI controller using Mr. Peetz's command, I found I was using an "Intel Cougar Point 6 port SATA AHCI" controller. Class 0106: 8086:1c02.

I search the ahci.map file referenced in his article and found my controller to be listed.

Not sure what that means but I hope it's a positive!

Reply
0 Kudos
SolidCactus
Contributor
Contributor

Thanks for the response.

My AHCI controller is supported out of the box from the GA build so I don't think I really need to do anything further in terms of article listed but at least it shows that your controller is recognized and ready for use.

When creating the Disk Groups I only have one host with any virtual machines running on it. Unfortunately, this exactly the same setup as I had in the beta builds and it worked without error.

Any other ideas?

Reply
0 Kudos
jkasalzuma
Enthusiast
Enthusiast

Well last night I tried again. Here is the path I took and the conclusion. Hint: it didn't go well...

Start with freshly build ESXi 5.5 up1 hosts.

1. Created a cluster with the following enabled

  DRS - Full Auto

  HA - Admission Control Disabled

  vSAN - Manual

2. Checked all 3 hosts for required settings and networking

3. Placed all 3 hosts into maintenance mode

4. Added hosts to vSAN cluster one at a time waiting for the "Update VSAN Config" to fully complete.

5. Verified all 3 hosts still saw both their SSd and HDD.

6. Exited Maint Mode 1 at a time waiting for the process to complete before removing the next from Maint Mode

7. Double checked all settings once again. (Figured treating this like a rocket launch would help).

8. Before adding any disks to vSAN I reviewed the Cluster Props > vSAN > General page and it saw

  3 hosts,

  0 fo 3 Eligible SSDs,

  0 of 3 Data disks,

  Total Cap 0.00B,

  Free Cap 0.00 B

  Network Normal

  Looks good!

9. Selected esx01, Clicked "Created a new disk group"

10. Selected the one SSD and one HDD, clicked OK and waited for "Create a New disk group" task to complete...

  Viewed the C# client and it stated "Initialize disks to be used by VSAN

  I probably will put in a feature request to be more specific in the Web Client.

11. The task timed out after 30 minutes and the HDD had disappeared. There was a spike in traffic to the HDD but then it quickly died out. (See attached image)

I think the vCenter vpxd timeout might need to be increased possibly. But still I don't think its going to solve anything.

  http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=101725...

It doesn't appear the GA vSAN is lab/enthusiast friendly at this point. If you are going to test for prod, you will probably need to pony up the cash for HW on the HCL.

Still I am going to export logs and open a case with VMware and see if they can lend a hand from purely a software POV.

This is obviously not 100% hardware related, there is probably a bug or mis-configuration somewhere. The fact that the host loses it's disk (a fully functional disk and controller w/o vSAN enabled) and has issues during reboots ONLY when vSAN is enabled means there is more going on under the hood.

So for now it appears vSAN is a no go for me unless VMware support is willing to lend a hand on unsupported HW and has some ideas.

VSAN Disk Blip.JPG

Maybe someone reading this has other ideas.

Thanks, JKasal

Reply
0 Kudos
SolidCactus
Contributor
Contributor

Hi,

Wow thanks for the update and the great detail you have supplied. Exactly the same scenario my end I'm afraid.

Please open the case and loop myself in as I'm happy to help provide logs etc to help get this resolved.

My AHCI driver is supported out of the box with 5.5U1 and might be able to lend a hand in getting this looked at?

Did the beta builds work for you at all? Do you know where the logs for vSAN are located?

Anyways let me know and happy to help out however possible!

Reply
0 Kudos
jkasalzuma
Enthusiast
Enthusiast

Cool I'll keep you posted on any findings.

I'm not sure of any logs for vSAN. I know you can start a trace but I think (and I'm hoping192.168. I'll be corrected if wrong" the vSAN uses the host logs since its really a host service. That would mean gathering logs through vCenter or the support Assistant should gather the important stuff.

On a host though all logs are in the /var/log directory.

Edit:

After clicking submit, I did find a log called vsanvpd.log under /var/log

Reply
0 Kudos
SolidCactus
Contributor
Contributor

Ok great. If you need any help with the case please let me know what you need and I will be glad to help out.

In the meantime, if anyone else has this issue or any ideas on troubleshooting this please chime in!

Also, are you using a Windows based vCenter or the vCSA?

Reply
0 Kudos
jkasalzuma
Enthusiast
Enthusiast

vCSA

Reply
0 Kudos
SolidCactus
Contributor
Contributor

Ok thanks for letting me know. I might roll a Windows based vCenter and see if the behavior is any different. I can't imagine it will be but still worth a shot to try and get it working!

Reply
0 Kudos
depping
Leadership
Leadership

SolidCactus wrote:

Thanks for the response.

My AHCI controller is supported out of the box from the GA build so I don't think I really need to do anything further in terms of article listed but at least it shows that your controller is recognized and ready for use.

When creating the Disk Groups I only have one host with any virtual machines running on it. Unfortunately, this exactly the same setup as I had in the beta builds and it worked without error.

Any other ideas?

Not sure where you see the support statement for AHCI? It is not on the HCL any longer. I have AHCI in my lab as well and am experience similar type of problems.

Duncan

-----------

Book out soon: Essential Virtual SAN: Administrator's Guide to VMware VSAN (VMware Press Technology)

Reply
0 Kudos
SolidCactus
Contributor
Contributor

Hi Duncan,

Thanks for pointing that out to me. Did they strip out the AHCI support from the GA builds then? My apologies I didn't realize it has been removed.

I will start trying to track down some supported controller cards. Any other ideas on where to go with this? It's good to hear you are experiencing similar sort of issues on the same setup.

BTW thanks for all your posts on Yellow Bricks about vSAN. They have been very informative.

Reply
0 Kudos
depping
Leadership
Leadership

SolidCactus wrote:

Hi Duncan,

Thanks for pointing that out to me. Did they strip out the AHCI support from the GA builds then? My apologies I didn't realize it has been removed.

I will start trying to track down some supported controller cards. Any other ideas on where to go with this? It's good to hear you are experiencing similar sort of issues on the same setup.

BTW thanks for all your posts on Yellow Bricks about vSAN. They have been very informative.

The AHCI controller was shortly listed on the VSAN HCL but has been removed as far as I can tell. I have heard of at least 3 others experiencing similar behavior with this controller and I personally recommend avoiding it for this reason. I was told though that these issues are being investigated and hopefully will be solved over time. I do not know when unfortunately. Maybe kmadnani can say more around support for the AHCI controller, and I can also see he would be interested in log files etc.

Duncan

-----------

Book out soon: Essential Virtual SAN: Administrator's Guide to VMware VSAN (VMware Press Technology)

Reply
0 Kudos
admin
Immortal
Immortal

Yes - we are looking into an issue we ran into with the AHCI controller/driver. We will be releasing a fix for this and supporting this controller as soon as possible. In the mean time, you can provide the log files to us, that would be helpful.

Thanks.
Kiran

Reply
0 Kudos
jkasalzuma
Enthusiast
Enthusiast

I was unable to gather logs. I had disabled vSAN and as soon as I initiated the log gathering, two of the hosts went unresponsive. with PSOD's.

I will follow my post procedure again and gather logs and open a case to attach them to.

Below are the two PSODs

esx21 PSOD.jpg

esx03 PSOD.jpg

Reply
0 Kudos