VMware Cloud Community
ddastoor
Contributor
Contributor

VSAN troubles - Performance degredation, erroneous traces, absence of garbage collection (depending on how you remove VMs)

Hello all! VMware customer, enthusiast and virtualization lover here

I'm posting here to shed some light into recent issues we have been facing since we began our VSAN initiative in hopes that VMware releases something to its users warning them about the potential time-bomb awaiting them.

We started this project back in December 2014 and put it into production in early January 2015. We used VSAN for our entire VDI environment (Citrix VDI-In-a-Box - VIAB) as well as a few of our higher requirement servers and everything worked fine for the first couple of weeks. Please note that we had followed all available documentation, tutorials, and industry suggestions, we researched for common problems and made sure we didn't fall into the same pits as others.

In late January/early February we started to get some very strange alerts about "vKernel VM Disk Latency" and "Virtual machinetotal disk latency". These alerts would arise sporadically and last for about an hour or so and then magically resolve themselves. During this time, our entire VSAN infrastructure would come to a grinding halt. ESXi hosts would disconnect themselves from vCenter (even though our vCenter was not on VSAN) and everyone's desktops would freeze at which point VIAB would start panicking and start "deleting" perfectly fine desktops and basically our organization would come to a grinding halt. We had no control over it and basically had to sit and wait for it to do its thing.

After diagnosing the issue using the VSAN Observer, we could see heavy congestion on specific VSAN hosts as they were trying to balance the cluster. But with no explainable pattern, VMware tech support had no answer except "increase your capacity", even though we were well below the recommended limit and had more than enough resources (flash to spinning ratio) available to satisfy the load. We had also noticed that the size of our data was growing quite fast compared to what we had seen in the past however at that time we had quite a few VMs using different fault tolerant levels and simply thought it was related to that.

After months of VMware tech support struggling with unrelated issues (such as the vsantraced service not being able to start) they were ready to give up and that's when we decided to blow up the current VSAN and start a brand new one from scratch except this time, only placing VIAB on it. It quickly became apparent that this growth in usage was actually a problem, possibly even THE problem. We noticed the usage increasing very fast while sustaining the same number of VMs and it seemed that every time we created a new golden image ("deleting" old VMs and creating new ones, but keeping the total number of VMs the same) the usage would drastically increase instead of staying roughly leveled.

After some more digging and testing, we were able to understand what was going on a little better. Apparently VSAN does not delete VM data unless the VM (and all of its disks) are deleted using the "All vCenter actions" > "delete from disk" or whatever the comparable new API call to this function is.

del-from-disk.jpg

If you or any service (like VIAB) simply "All vCenter Actions" > "Remove from Inventory" and then "Delete File" from the DS browser or through a comparable API call, the VM folder (namespace) will disappear from the VSAN datastore however all of its data will continue to reside in your infrastructure without any reference/pointer and will not be cleaned up by the CMMDS.

del-from-ds.jpg

You can verify this yourself by finding the UUID (confirm friendly name in cmmds-tool output to make sure) of the VM, "remove from inventory" and then deleting the VM folder from the datastore and then running the cmmds-tool and objtool on the UUID to see all of the details still lingering in VSAN's inventory. Of course since you have this UUID on hand you can manually delete these objects using the objtool however this is an extremely manual process and requires you to know the VSAN UUID of each VM that is swimming in the void and needs to be deleted.

Today after an exchange with VMware tech support, we have concluded that we will not be using VSAN any longer and have started looking into alternate solutions because of the way VMware handled this issue and their non-existent plan for fixing a major problem.

Hopefully this post sheds some light on issues some of you may be having to save you months of wasted time and potentially even prevent some of you from losing your jobs. If you or an application you're using is deleting objects from the VSAN datastore directly, STOP RIGHT NOW!

tl;dr - Set up VSAN to VMware recommended spec and after a few weeks, extreme latency and hundreds of alerts for hours on end that would bring down entire VSAN infrastructure. Lack of garbage collection on said "legacy" features caused this.

VMware problem to fix: Deleting unregistered VMs from VSAN datastore does not actually delete the data and will leave this data to roam within DS freely without any reference taking up space and resources and eventually when tried to balance by VSAN, causes extreme performance issues.

VMware response to problem: "Deleting a file through the datastore browser, while not disallowed, is not supported in regards to VSAN due to the handling we have already discussed. There are currently no plans to change this behavior."

Reply
0 Kudos
9 Replies
zdickinson
Expert
Expert

I would be very interested to hear Cormac and Duncan reply to this.  We have vSAN in DR and use Zerto to replicate VMs.  It also creates and deletes files on the vSAN datastore, and it sounds similar to your Citrix solution.

Does your Citrix VIAB specifically support vSAN?  For instance, we had to wait until our replication software (Zerto) supported vSAN.

Did you experience this with vSphere 5.5, 6, or both?

Thank you, Zach.

Reply
0 Kudos
ddastoor
Contributor
Contributor

Hey Zach,

From the behavior I have seen, I would say that VIAB does not support VSAN.

The latest release of VIAB is quite old and Citrix really isn't talking/releasing much news about this product - they're being very hush-hush about anything they say regarding VIAB. I have an escalated ticket with Citrix that has been pending response for two weeks to which I'm in the process of contacting management about so hopefully I'll have some concrete results soon.

We experienced all of our issues with vSphere 5.5

I'm curious to know if the filesystem changes coming in VSAN 6 will make any difference in this issue however after reading the response I got from the tech regarding this case doesn't make me too hopeful.

I have also submitted a "feature" request regarding this problem. I hope the engineers/devs understand how serious of an issue this is.

Reply
0 Kudos
cdekter
VMware Employee
VMware Employee

I have been working with the VMware support team on this case. The original poster is already familiar with me but for the benefit of other readers, I am the Customer Advocacy manager for VSAN.

Thank you for taking the time to post about your experiences using VSAN. It is clear from your posting that your support experience has been less than satisfactory – for this, please accept my sincerest apologies. I have completed a further review of your case with the assistance of the VSAN engineering team. Unfortunately, the Citrix VIAB software is employing a legacy file API that VMware advises against using. This has caused the undesirable behavior you are seeing: leaked objects. The correct API to use for creating and deleting VMs and their associated disks is the VirtualMachine vSphere API. This is not a new API for VSAN, but rather has been the standard way of performing operations on virtual machines for some time now. So, it would be fair to say that the legacy software that has not been updated to interact correctly has caused the problem in this specific case.

With regards to the issue of manually deleting files from the datastore browser, this is without doubt currently a shortcoming. We intend to improve how this feature interoperates with VSAN, either by extending it to work correctly or disabling certain functions that are not supported on VSAN. In the mean time, we will publish a KB article warning customers not to use this specific method to delete files on VSAN.


If compatibility with Citrix VIAB is critical to your application, it is possible that VSAN may not be the right storage product in this case. We apologize for this, and hope that you would consider VSAN for other applications in the future.

Regards,

Chris Dekter

Reply
0 Kudos
zdickinson
Expert
Expert

Chris, as a new vSAN user, I am very interested in this issue.  I understand that VIAB using a legacy API was the root cause of the issue.  And this shouldn't be an issue with newer softwares that use the current API.  And that you should not remove a VM from inventory and then delete the files, but instead use the delete from disk in the VIC or Web Client.

However, what about raw files you've placed on a vSAN datastore?  Specifically ISO files for OS installs.  I upload those via WinSCP and then use WinSCP to delete them when not needed.  From my understanding these ISO files would be orphaned and still taking up space on the vSAN datastore.  Is that correct?  Is there a way to do "clean up"?

Does this behavior persist in vSphere 6?

Thank you, Zach.

Reply
0 Kudos
ddastoor
Contributor
Contributor

Hey Zach,

I believe VSAN does things quite differently when it comes to files and you are not supposed to use it as a traditional DS by placing "normal" files on it.

From what I've soaked in over the past few months (of VSAN in 5.5), VSAN deals in VM objects and everything is designed to work around only this. It should not be mistaken for a common NTFS file system onto which you can place files. All of the load balancing and high availability mechanisms require everything on VSAN to be VM objects and not other file types (ISO, ZIP etc). For this reason I think you should probably remove any such files from VSAN else you may run into problems in the future.

I'm hoping that this "new file system" in VSAN6 removes this restriction because currently you would need some secondary storage mechanism like extra local disks or NAS to save files that are not VM objects, making VSAN quite a constrained/narrow product (quite far from VMware's vision of virtualizing the storage platform).

But to answer your question, I don't think what you are doing would make the data orphaned as the mechanisms of VSAN don't know of the existence of this data (as it was placed by a file transfer outside of vSphere). I'm not sure if VSAN would still try to load balance it or ignore it all together but I wouldn't put any more "normal" files on it until someone here answers your question with more certainty.

Reply
0 Kudos
ramakrishnak
VMware Employee
VMware Employee

> But to answer your question, I don't think what you are doing would make the data orphaned as the mechanisms of VSAN don't know of the existence of this data (as it was placed by a file transfer outside of vSphere). I'm not sure if VSAN would still try to load balance it or ignore it all together but I wouldn't put any more "normal" files on it until someone here answers your question with more certainty.


Yes, correct.

VSAN is an object store. As of today, the VM namespace (VM dir and its constituent files residing under vmdir), VM swap files, and individual virtual disks, VM snapshots) are objects.

Rest of the files are like any other files in traditional store and vsan would not mangle with them nor enforce a load balancing or vsan policies on them.

The side effect i can think of is storage space consumption.

Also note these are not Orphaned. if user deletes iso, its removed like in any other traditional filesystems today

To answer the question to original query

Unfortunately what ddastoor noticed is a symptom (issue) where file-browser delete of vsan directory without cleaning up the disk objects, can cause issues. Unfortunately the fix didn't make into the product....

The fix was to either prevent user from deleting the vsan dir if they are not empty, or cleanup objects before directory removal


when a user deletes the directory via file-browser, or issues unsupported cmds like rm -rf

In such case the objects are healthy as far as VSAN is concerned but there is no way to reach that object by user since the object descriptor was removed.

There are some best-practices/workaround exists on these fronts to prevent or cleanup such objects until some of these issues are addressed

- When using datastore file-browser to cleanup. first delete individual files from the dir

- Never use traditional storage cmds to cleanup vsan objects like rm -rf

- Use cmdline variants like /usr/lib/vmware/osfs-rmdir to remove the stale directories which will prevent user from deleting the directory if it detects objects/files in that directory

- There are internal special cmds like objtool/cmmds-tool where you can remove such objects after proper identification of the disks you want to remove

Cormacs troubleshooting guide covers in detail.

Thanks,

Reply
0 Kudos
cdekter
VMware Employee
VMware Employee

ramakrishnak has covered the key topics, but I would like to add one more answer regarding the question about the new on-disk format for VSAN 6.0:

ddastoor wrote:

I'm hoping that this "new file system" in VSAN6 removes this restriction because currently you would need some secondary storage mechanism like extra local disks or NAS to save files that are not VM objects, making VSAN quite a constrained/narrow product (quite far from VMware's vision of virtualizing the storage platform).

There is no change to the behavior when placing non-virtual machine files on the VSAN datastore between VSAN 5.5 and 6.0. You are able to create folders in the VSAN datastore and place files inside these folders. The folders will each be backed by a VSAN object and the storage policy applied to this object will be the default one configured for the cluster. Later if you wanted to remove this directory, you would need to use the osfs-rmdir command to ensure that the object is not left behind.

Reply
0 Kudos
zdickinson
Expert
Expert

Do I understand this correctly.  If I have an ISO folder that contains .iso files and I go into the folder and delete the files and then the folder, they will be removed.  However, if I delete the folder itself before removing the .iso files, the data will remain; but not visible to me.

Is there any documentation explaining this?  I feel like I've read it all, including Cormac and Duncan's book, Essential Virtual SAN, and this is the first I am hearing of this.  Is there a way to request documentation?

Thank you, Zach.

Reply
0 Kudos
cdekter
VMware Employee
VMware Employee

That is more or less correct, Zach. If you delete the files, the space they consumed will be released. If you delete the folder without deleting the files first, the space will not be released - you would then need to delete the object manually.

I don't believe we currently have a good document for this scenario. As I mentioned in my first post, we will be writing up a knowledge base article detailing the issues and processes around deleting files manually from the VSAN datastore.

Reply
0 Kudos