VMware Cloud Community
jmelchio
Contributor
Contributor

vsan.resync_dashboard

I am currently running a proof of concept instance of Virtual SAN in a test environment (preparing to move into a dev/qa rollout).

I was testing and document disk failure/replacement procedures. I noticed after testing an SSD failed/replacement the vsan.resync_dashboard command showed no objects left to sync. This seemed odd to me, as I am running only a three node cluster, and replacing an SSD should eliminate the entire diskgroup.

I checked the storage policy compliance and all items were compliant. Though each object did show an absent component (disks were pulled instead of failed so it triggered the wait period). As I looked at each vm in the storage policy, I noticed that all the VM objects on that disk group were actually witness files. Eventually all the VMs in the datastore were synced up and showed fully healthy.

I began researching the issue and came across the article below that indicates the vsan.resync_dashboard command only reports VMs resyncing and not templates. Does this also apply to the witness file component of a VM? If so, is there any plan to expand the resync_dashboard command?

A secondary question, is it normal for vsan to dump all my witness files onto the same host in a 3 node cluster?

vsan.resync_dashboard only reports VM resyncing, not templates | CormacHogan.com

Tags (3)
10 Replies
ramakrishnak
VMware Employee
VMware Employee

> Does this also apply to the witness file component of a VM? If so, is there any plan to expand the resync_dashboard command?

Witness component is a zerobyte object file. so i would assume it will not show up in the calculation for "Bytes to Sync". 


Not sure if resync_dashboard is accounting for witness component, will check and get back to you..

> A secondary question, is it normal for vsan to dump all my witness files onto the same host in a 3 node cluster?

This can happen based on the kind of setup you have, but we try to distribute the objects/components equally among hosts in the cluster. unless you have imbalanced cluster w.r.t to diskgroups.

you can run vsan.disks_stats  to check the component distribution across nodes and disks

ramakrishnak
VMware Employee
VMware Employee

> I was testing and document disk failure/replacement procedures. I noticed after testing an SSD failed/replacement the vsan.resync_dashboard command showed no objects left to sync. This seemed odd to me, as I am running only a three node cluster, and replacing an SSD should eliminate the entire diskgroup.

vsan resync time is default 60 min.

ie resync will trigger only after 60 min.

vsan.resync_dashboard will show the details only after this time

can you check if this is the case ?

Reply
0 Kudos
CHogan
VMware Employee
VMware Employee

It might be worthwhile reading the new Troubleshooting Reference Manual, especially the section about replacing drives, and what to expect. The "Tips for a successful VSAN evaluation", written for 5.5, also provides steps on how to evaluate a drive failure.

As Rama pointed out, it is critical to understand the difference between ABSENT & DEGRADED components, and what VSAN does when these conditions occur.

All of the docs mentioned here can be found on the VSAN Resource page - VMware Virtual SAN Technical Resources | VMware Australia

HTH

Cormac

http://cormachogan.com
jmelchio
Contributor
Contributor

Thanks for the quick replies.

I will check out the troubleshooting reference manual. I have read the evaluation guide (great document btw, it has helped me a ton in my evaluation) so I have a good handle on the differences between the absent and degraded state.

In this case, I did wait the 60 minutes for the absent timer to expire (I waited closer to two hours). The host i was testing was in maintenance mode (to remove running VMs). It didn't start the rebuild process until I removed the host from maintenance mode. At that point I could see the various VM objects go from absent to healthy. However, the the resync_dashboard command did not show any syncing items. I ran vsan.disks_stats command and in the Num Comp column I could see the number of components increasing on disks in that host. As I said in my original post, the only items that were on that host were Witness objects so it got me thinking that the resync dashboard only shows VMDK objects.

Reply
0 Kudos
ramakrishnak
VMware Employee
VMware Employee

> were on that host were Witness objects so it got me thinking that the resync dashboard only shows VMDK objects.

Yes, this is correct.

Witness objects are zero byte objects.

Thanks,

Reply
0 Kudos
ChrisKuhns
Enthusiast
Enthusiast

Cormac,

What if none of the triggers are being hit to begin the rebuilds? I have had objects with ABSENT and DEGRADED for days, no rebuilds took place against those objects. I check the resync as well as chance all of the  I have put in a ticket and not received anything back from VMware. Now my Exchange Archive Server has become unable to boot and I have drives that are now going to be permanently corrupted. I on't get it, because all disks are physically healthy, and all RVC commands for checking health come back healthy.

I own your book, but I can't seem to find something that allows me to target objects and delete them outright.         

Reply
0 Kudos
CHogan
VMware Employee
VMware Employee

What is the nature of the failure in the cluster?

Do you have enough available resources left in the cluster to accommodate a rebuild?

For example, if you have a 3 node cluster, and one node fails, there are no resources to build the components.

Another example, if you are running at near full capacity, and you have a disk failure, you may not have enough storage space to rebuild the components.

These are some of the reasons why a rebuild might not initiate when you have absent/degraded components.

HTH

Cormac

http://cormachogan.com
Reply
0 Kudos
ChrisKuhns
Enthusiast
Enthusiast

Originally I received a permanent disk failure Wednesday of this past week to which I couldn't determine what the issue was. It literally looked as though there some nothing wrong. I ran through troubleshooting and looked at the physical disks, there wasn't anything I could determine to be wrong. I tried to evacuate the host for a day and it crashed. I did a reboot to enter in to the Dell services. It came back up and was working just fine. It was as though nothing happened.

I have a four node cluster Of the 72.76 TB, 40 TB is free, so there was more than enough room for rebuild.

Dell R720XD

CPU: Node 1-3 Dual Socket Intel Xeon E5-2630 V2 / Node 4 Dual Socket Intel Xeon E5-2650 V2

MEMORY: 128 GB  (8 GB RDIMM)

NICS: Intel X520 DP 10Gb DA/SFP+, + I350 DP 1Gb Ethernet

DISKS: 4 - 960 GB SSD / 5 - 4 TB SATA II 7200 rpm

Also, the things that are syncing are nuts. It had these things one VMDK resyncing.at 27196 GB left to go. The entire VM itself is only 5.8 TB. Multiple machines have VMDKs with components that are not rebuilding. It's been well over 60 minutes for anything that is marked absent, and the DEGRADED items are just sitting there.

One thing to note is that the person that originally built the disks, didn't present them as individuals, he created one large virtual disk for each host. I was planning on rectifying this on my upgrade to vSphere / VSAN 6.0 but I have four components that are inaccessible.

Thanks for anything you can provide here. Both you and Duncan have provided a great deal of information. This was our prototype box. I have great faith in the VSAN to change the direction our Datacenter for my school district goes this summer when we vote for funds. A 9 node VSAN with full disks is on the block. I don't want data loss to sour that deal.

Reply
0 Kudos
CHogan
VMware Employee
VMware Employee

Chris,

I don't want to jump to any conclusions, but this is worrying ...

One thing to note is that the person that originally built the disks, didn't present them as individuals, he created one large virtual disk for each host. I was planning on rectifying this on my upgrade to vSphere / VSAN 6.0 but I have four components that are inaccessible.

If this is in a RAID0 configuration, and you have a disk failure, this will take out all of the storage on that host. Therefore every component on that host would need to be rebuilt.

This could be why many components are not syncing - but I am jumping to conclusions.

I would continue to work this through the service request with GSS, and hopefully they can set you right.

I would also try to rectify this disk configuration asap - if it is indeed the root cause, you do not want it to happen again.

HTH

Cormac

http://cormachogan.com
ChrisKuhns
Enthusiast
Enthusiast

Thanks, Cormac. I got on it last night like gangbusters and started to break up and wipe the disks. and start over on the configurations. As for GSS, Ha, we don't have that kind of cash flow! Smiley Wink Normally, our basic support does the job, when they contact you. I put the ticket in a week ago and haven't heard back. As always, you've been a huge help and its appreciated.

Reply
0 Kudos