Hello VSAN Enthusiast’s,
We are currently encountering an odd issue with ESXi 6 and Virtual SAN 6. Currently, we have no virtual machines or data running on the vsanDatastore. However, when I pull a space utilization chart from vCenter Web Client it shows 10tb of “Other” Files in use and show a 1.7tb of Virtual Disk in use.
Furthermore when I drop into RVC and run vsan.disks_stats I show used space on every drive being anywhere from 65% to 78%. When I run vsan.check_state I get “Did not find VM’s for which VC/hostd/vmx are out of sync”. When I “ls” the vsanDatastore I show 65.6% used. When I run vsan.resync_dashboard it shows 0 to resync. Another odd symptom, is when I use WinSCP the only files showing on the vsanDatastore are all .sff.sd folders inside of the “Removed” VM’s named folders that were on the DS with nothing in them, but it takes quite a while to populate that there is nothing in those folders.
We have resorted to running all of our VM’s from an NFS Datastore that resides on a NX3100 Powervault until VMware can come up with a solution, but we are using the compute resources from the cluster.
Cluster Information (All hardware is on the HCL and latest firmware from Dell applied):
(4) Dell PowerEdge R730 (4 node cluster)
My question is, there seems to be an issue with VSAN removing data from the vsanDatastore, I am wondering if there are similar experiences to mine with similar hardware or if anyone has any advice or suggestions. We currently have a case open with VMware (3 weeks and counting) since we have production support and they are trying to work the issue on their end but have yet to determine the cause. We have uploaded logs and provided as much info as they need but I am stumped as to what to do or where to go from here as it seems that this is taking VMware quite some time to figure out. Any input or guidance is greatly appreciated.
can you let me know the SR number of your case? I will dig up the details. Let's get some more info on whats stored on your VSAN. First, you run "vsan.obj_status_report", it should give you an idea how many objects we are talking about and their health. It may say "1/3", or "3/3" or something like that for their health. You can then run the command again, but now with --filter-table 3/3 (for example, pick a big group) and then also pass --print-uuids. This will give us an overview of the stored UUIDs. You can then use "vsan.object_info" to learn more about these objects, which may tell us what they used to be (VMDKs, namespaces, swap, etc.).
Btw, if you want to just reset the VSAN datastore, get back to empty disks, I can help you with that, but likely best to spend a few cycles to get to know the issue better.
Thanks for taking the time to jump in.
Here is the output of vsan.obj_status_report:
Here is the output of vsan.ob_status_report --filter-table 3/3:
At this point I think resetting VSAN is the only thing left but I want to make sure that this will not affect the vm's on the nfs datastore. I tried to PM you my SR number, but could not find where I could do that. If you can PM me I can reply back with my SR#.
OK, so we know that all the objects you have are healthy from a VSAN perspective. How did you delete your VMs? Did you go the datastore browser, select the directories and delete? Or which workflow did you use?
Next step up, run "vsan.object_info <cluster> <objUuid>" against some of the UUIDs shown in the second screenshot. Do this for maybe 5 to 10 objects. It should tell us more about the identity of these objects, like which .vmdk file path they used to be represented by.
I think some other people are going the same direction, depending on how files are deleted they may actually not be deleted. This can happen if you remove a VM from inventory and then delete files from the datastore browser, or if a solution is using an outdated API to delete files. I don't believe there is an automated tool for clean up. I believe it is possible to do it manually, don't know what it is though. Thank you, Zach.
Sorry for the AFK, long days here. We came in yesterday to find one of our main DB VM's completely offline, and had to perform a restore from backup. Needless to say yesterday was Monday 2.0 and people were not happy as they had to recreate an entire days worth of work, because for some reason veeam and vsan did not play well together during backup despite veeam backing up our vm's for 3 weeks without problems. After speaking with another vmware tech he indicated that the issue was do to an "underlying" issue within VSAN and would let the lead tech on the case know what else was going on. After a phone call with vmware management I was able to get a very good engineer. He sat with us and confirmed that all of our VM's had been moved from the vsanDatastore to our NFS Datastore and he proceeded to destroy the Disk groups and vsanDatstore in preparation for a rebuild. Unfortunately, despite what vCenter and RVC were telling us, production VM's/vmdk's were still running inside of the vsanDatastore. He got about 75% of the way and stopped once we saw our VM's going offline. After working with another really smart vmware engineer he was able to rebuild some of our data and we had to restore the rest from backup to our nfs datastore.
We tested VSAN in our lab prior to going to production, we researched many of the pitfalls that others had encountered with 5.5/6, and made sure all of our hardware was on the HCL.
This proved useless as there seems to be an issue with the way vsan's 2.0 disk structure tries to remove and relocate data. If we try to right click the vm from within vcenter and select delete from disk, the data never really removes and thus "Other" files are born. If we right click the vm and select move to, the data never really moves and the size of "Other" files increases. The engineers confirmed my networking and switch setup were correct so no networking issues there and an audit of our setup only indicated driver updates for the nic and perc, which i am going to apply today.
I believe vmware is aware of an issue within the vsan software and are working to find a resolution. Unfortunately we paid the price in data loss. I am still working with VMware today and will let you know how the vsanDatastore removal goes as we are going to try and attempt this again. Vmware has had a great track record, from my perspective and being a vmware user/sysadmin/enthusiast for 10 years I feel this is one product that should have been tested further before being released to GA. I will keep the thread updated as much as possible, but we are in crisis/dr mode here so don't be surprised if it is a few days before I update.
I am the architect for VSAN and something doesn't add up in what you state here (or what the engineer you were working with explained back). I am trying to find out the SR number so I can find out what exactly happened and make sure it gets addressed correctly. The behavior you describe would certainly be incorrect, but believe me the scenario you describe is a) tested and b) not fundamentally broken with the v2 disk format. I will get to the bottom of this. I am also available for a call. Please drop my name on the case while I try to find out the case number from my end as well.
How do I forward my case to you without putting my SR# in the forums or can you pm me your number to call? I would love to have another engineer involved, also Katrena Chase is the manager I have been in contact with, perhaps contacting her would get you my contact info. I know from a forum perspective sometimes its hard to wrap your head around something that doesn't seem logical, especially since the product was not designed to behave in such a manner. Also, don't get me wrong here I am not trying to "bash" VMware and I do believe that this product has great potential, (yes even after we experienced data loss) I am simply making my experience known.
I see that in my inbox, but when I click on it I get:
Little something for anybody else...
I also experienced similar findings when deleting VMs/vApps from vCloud Director 5.5.4 that reside on a VSAN 6.0 cluster/datastore - there were files that appeared to remain on the VSAN datastore browser, that could not be deleted.
Issue was resolved by consolidating the Library Copy/vApp Template in vCloud.
The template was originally copied over from another NFS datastore. The shadow VMs were removed from the source datastore, but the destination needs to be consolidated.