First post. Hope I'm doing this right.
I upgraded from VMWare Workstation Pro v12 to v15 about 6 months ago. Shortly after doing so I started noticing an issue where, during my normal workflow (fire up a VM, run some tests, revert the VM to a Snapshopt, fire up the VM again to run more tests) the VM would suddenly no longer be able to boot. Moreover, once in the bad state, reverting to snapshots also failed. Until today I'd had trouble tracking down a chain of events that cause the behavior. I'm usually engrossed in my work (testing and validating bugs and bugfixes) so I'm usually not paying close enough attention to VMWare itself to write up what I would consider an adequate bug report -- and I understand how frustrating it is to try to track down a badly defined bug with missing or fuzzy repro steps.
I used v12 with this flow for YEARS and never ran into this issue, and since upgrading I run into it at least a couple times a month, so it would appear to be a regression somewhere between v12 and v15.5. In case it may be a factor, both my host and guest OS (in this case) are Windows 10 x64, but I've also experienced the issue with various other Windows VMs. For what it is worth, I have a colleague who has the same workflow as I do -- and he has experienced the same intermittent issue as I have. He's only ever had Workstation v15.5, and is running on completely different hardware than mine, so the issue doesn't appear to have anything to do with my having upgraded Workstation previously or to my specific system configuration.
This has mostly just been a nuisance because I can usually just restore from a backup, but today I made some lengthy changes to my test environment and had not yet backed them up. Then the issue happened:
Rather than spending hours trying to restore from a backup and repeat my work I decided to try to recover. The total size of the VM hadn't changed (it was in the ballpark of 48GB before and after the snapshot restore) so it was unlikely that any data had actually been lost. Checking the VM's folder I can confirm that the 000004 file did not in-fact exist. What was strange was that it appeared that there *shouldn't* be a 4th vmdk diff file since the VM only has 3 snapshots. I checked a couple other vmx files for other VMs that also had snapshots and realized that the only line that seemed to be suspect in my broken VM was: scsi0:0.fileName = "PAR8500 VMWare Converted-000004.vmdk". I changed the 4 to a 3, saved, relaunched the GUI, and my VM was able to boot and move between snapshots again. I'd figured out the breaking point in the failure and how to recover from it.
Steps leading up to the error:
I'm pretty sure that step 3 in that flow is what triggers the issue, and timing is likely important. Recalling previous failures with hindsight, I know that changing other VM properties (not RAM) can also cause the failure. I've tried to repro the issue and have not yet been able to. My theory is that SOMETHING about restoring a Snapshot and then immediately attempting to alter the VM properties is causing the the UI to write the wrong value for the .vmdk that the machine should be referencing. To the best of my knowledge, the issue is always n+1, but I cannot confirm.
I hope this can be resolved in the not too distant future -- not for me since I now have a workaround -- but for anyone else experiencing the issue who doesn't know what is going on and might be losing valuable work.
Steps leading up to the error:
Question about step 2 ...
How do you "restore an Offline Snapshot"
Between #2 and #3 the VM will be powered off. So are you sure you waited long enough until the UI is really done powering off.
Starting with a running VM, I open the Snapshot manager and revert to a snapshot of the VM in a "Powered Off" state.
I think this specific flow is part of what sets up the issue to happen because the UI is doing several things all at once.
In trying to make a screencap of my workflow I inadvertently reproduced the issue:
the vmdk linked to my snapshot was previously "000003" but somehow when snapping back and then immediately altering the VM's specs, the file was renamed to "000004"...? This doesn't quite line up with my theory, but at least shows a repro. I'll see if I can get a better recording that also shows the folder contents changing while this is all happening. This doesn't happen when *just* snapping back to the earlier snapshot.
I've done the same series of steps with both the folder and the vmx file open about 5 times now and haven't been able to get the issue to repro. What I *have* noticed is that each time the snapshot is reverted, the vmdk file seems to alternate between "000004" and "000003". When it works successfully it is because the filename and the value in the .vmx file match. When there is a problem it is because the two values are out of sync -- seemingly because the vmx did not get updated with the new value.
The crux of the issue seems to be the UI needing to update the .vmx file with this alternating index at the end of the vmdk filename and then also attempting to update a second value in the .vmx file. Somehow, when making the 2nd update, it loses track of the alternating index on the .vmdk, writes the old index value instead, and BAM the vmx and vmdk are out of sync and the VM is now in a bad state.
That's really insightful and I think does a good job explaining at least part of the mechanism behind what's going on. Thanks for finding that and linking it!
In regards to the linked issue... yeah it makes sense that it'd force queue multiple snapshot-saves since a previous one still had to be written out. There would be nothing you'd be able to do about that.
In my scenario there was no snapshot saving, just a snapshot restore. What it seems is happening in my scenario is that during the snapshot revert, the UI is writing the new vmdk name to the vmx. Then, when the VM spec is changed, that uses a stale (cached?) version of the vmx file to write its changes to.
:smileyalert: (lightbulb emoji) Another theory into the mechanism at play:
When reverting the snapshot the Snapshot UI (in 1 thread) updates the vmx with the new vmdk index.
Then the main UI, with a cached version of the vmx file (?), in another thread makes a second write to the vmx file, effectively undoing the change made by the Snapshot UI thread.
Since the issue only seems to happen intermittently - it would appear that there IS a mechanism that is attempting to synchronize these events, but something about its timing isn't completely dialed in, and every once in a while it fails, allowing the issue to occur.
With these assumptions about how the UI functions, this is what I think would need to be changed to prevent this issue from happening:
When the UI is making new writes to the vmx file, it needs to make sure that there are no pending writes (from another thread?) and/or that it is not making its edits to a stale version of the vmx.
Thanks for the diligent investigation and the detailed defect report. (And a very calm and composed forum post, under the circumstances... it's a shocker of a defect you've found, and you'd have every reason to be rather upset about it.)
I have filed an internal problem report for this issue, so our UI engineers should investigate this soon, and I would expect that we will want to have this fixed promptly because of the potential for corrupting the .vmx file (even though it is recoverable with some manual editing, as you have found).
I have exactly the same bug with VMware Workstation 15.5.2 on 2 differents computers :
- 1st : Intel Core i7 4790K, 16 GB RAM, 4 differents SSD, Windows 10 Enterprise LTSC 2019.
- 2nd : Intel Core i5 7500, 16 GB RAM, 1 SSD, Windows 10 Pro (version 1909).
Very often, when a revert to a snapshot, I can't start again my VM because it became corrupted (missing .vmdk).
Now, I have to do many manual backups for my VMs because I already lost a lot of VMs (tests VMs, fortunately).
Thank you for your help.
Hi MrCommunistGen and freebeing,
Workstation 15.5.5 is being released right now and I just figured I would give you a heads-up that this issue is not resolved in this update. You'll have to stay tuned for future updates, I'm afraid.
Thanks again for your excellent problem reports and your patience.
Thanks for following up on the issue last week and keeping me apprised of the situation. Even though it hasn't yet been addressed, I'm pleased that this has garnered official attention from the VMWare organization.
I try to be reasonable and detailed in any issues I report because as part of my job I work on both sides of our product. I work with our internal service teams in identifying, reproducing, and documenting bugs. I also work closely with our engineering team to make sure that issues are properly understood and that proposed fixes behave in an acceptable manner.
Using a bug report to rant or yell at people doesn't help accomplish the ultimate goal of getting the issue fixed and serves no purpose.
Have you tried editing the .vmx file after you experienced corruption?
You might be able to "fix" your VMs without having to restore from backup. At least in my situation, the vmdk line in the .vmx seems to be off +/- 1. Since finding this, it has saved me a LOT of time having to copy my VMs from backup. (I'd still recommend making backups just in case some other issue such as disk failure occurs).
I'm surprised with VMware Workstation 15.5.5 update because today I tried many tests without any bug.
So, maybe the bug appears only under specific circumstances (maybe settings I didn't used today) ?
I am on VMWare 15.5.6 and experience this on a regular basis. I never experienced this when running VMWare 14. The only difference in my setup between using 14 and 15 is that my host machine on 14 was Windows 7 and my host machine with 15 is Win 10 (Ent edition version 1809).
Host machine specifics:
Windows 10 Enterprise edition (1809)
Intel Xeon Gold 5120 @2.20GHz (2 processors)
C drive on 2Tb SSD
D drive (VM folders) 2Tb SSD
E drive (VM folders) 4Tb HDD
F drive (shared folders) 2Tb USB3.0 My Passport drive
In my scenarios, I have found that I don't even need to try to modify the VM properties. Sometimes if I revert to a powered off snapshot while the VM is powered on, it will break it and say the disk is missing when I try to power on the VM once the snapshot is restored. And then every snapshot will complain that the disk is missing.
This ONLY happens if I revert to a snapshot while the VM is running. (Note all of my snapshots are taken with the VM powered off). If I shut down the VM before switching snapshots it works fine every time. This is a real pain to have to shut down my VMs to get to a different snapshot all the time though.
The only way I have been able to fix the issue when it happens is to remove the disk from the VM and then revert to the snapshot again. For some reason, after removing the disk, reverting to a snapshot will recreate the correct disk for the snapshot.
I really hope VMWare can get this fixed.
Looks like the problem is stil there in WS 16.1!
I had the same problem with a VM with multiple snapshots (all in powered-off states)
Empty --> Windows in Audit Mode --> Completed
Then going back to "Empty" and changing the DVD mount information from the VM settings page.
The VM then says it cannot find the correct VMDK and all is lost. (I must say I didn't tried to edit manually the VMX file though).
But, it I power-on the VM THEN change the mount option for the DVD, it works.