9 Replies Latest reply on May 29, 2020 9:49 AM by MrCommunistGen

    Bug: VMWare WS v15.5 UI can corrupt .vmx after reverting from Snapshot

    MrCommunistGen Lurker

      First post.  Hope I'm doing this right.

       

      I upgraded from VMWare Workstation Pro v12 to v15 about 6 months ago.  Shortly after doing so I started noticing an issue where, during my normal workflow (fire up a VM, run some tests, revert the VM to a Snapshopt, fire up the VM again to run more tests) the VM would suddenly no longer be able to boot.  Moreover, once in the bad state, reverting to snapshots also failed.  Until today I'd had trouble tracking down a chain of events that cause the behavior.  I'm usually engrossed in my work (testing and validating bugs and bugfixes) so I'm usually not paying close enough attention to VMWare itself to write up what I would consider an adequate bug report -- and I understand how frustrating it is to try to track down a badly defined bug with missing or fuzzy repro steps. 

       

      I used v12 with this flow for YEARS and never ran into this issue, and since upgrading I run into it at least a couple times a month, so it would appear to be a regression somewhere between v12 and v15.5.  In case it may be a factor, both my host and guest OS (in this case) are Windows 10 x64, but I've also experienced the issue with various other Windows VMs.  For what it is worth, I have a colleague who has the same workflow as I do -- and he has experienced the same intermittent issue as I have.  He's only ever had Workstation v15.5, and is running on completely different hardware than mine, so the issue doesn't appear to have anything to do with my having upgraded Workstation previously or to my specific system configuration. 

       

      This has mostly just been a nuisance because I can usually just restore from a backup, but today I made some lengthy changes to my test environment and had not yet backed them up.  Then the issue happened:

       

      VMWare error.png

       

      Rather than spending hours trying to restore from a backup and repeat my work I decided to try to recover.  The total size of the VM hadn't changed (it was in the ballpark of 48GB before and after the snapshot restore) so it was unlikely that any data had actually been lost.  Checking the VM's folder I can confirm that the 000004 file did not in-fact exist.  What was strange was that it appeared that there *shouldn't* be a 4th vmdk diff file since the VM only has 3 snapshots.  I checked a couple other vmx files for other VMs that also had snapshots and realized that the only line that seemed to be suspect in my broken VM was: scsi0:0.fileName = "PAR8500 VMWare Converted-000004.vmdk".  I changed the 4 to a 3, saved, relaunched the GUI, and my VM was able to boot and move between snapshots again.  I'd figured out the breaking point in the failure and how to recover from it.

       

      Steps leading up to the error:

      1. VM Running
      2. Restore Offline Snapshot
      3. Immediately After UI says that snapshot has been restored, change the RAM allotted to the VM (I'd mistakenly saved my offline snapshot with too little RAM).
      4. Attempt to boot VM -> Error

       

      I'm pretty sure that step 3 in that flow is what triggers the issue, and timing is likely important.  Recalling previous failures with hindsight, I know that changing other VM properties (not RAM) can also cause the failure.  I've tried to repro the issue and have not yet been able to.  My theory is that SOMETHING about restoring a Snapshot and then immediately attempting to alter the VM properties is causing the the UI to write the wrong value for the .vmdk that the machine should be referencing.  To the best of my knowledge, the issue is always n+1, but I cannot confirm.

       

      I hope this can be resolved in the not too distant future -- not for me since I now have a workaround -- but for anyone else experiencing the issue who doesn't know what is going on and might be losing valuable work.

       

      Cheers!

        • 1. Re: Bug: VMWare WS v15.5 UI can corrupt .vmx after reverting from Snapshot
          continuum Guru
          vExpertCommunity Warriors

          Steps leading up to the error:

          1. VM Running
          2. Restore Offline Snapshot
          3. Immediately After UI says that snapshot has been restored, change the RAM allotted to the VM (I'd mistakenly saved my offline snapshot with too little RAM).
          4. Attempt to boot VM -> Error

           

          Question about step 2 ...

          How do you "restore an Offline Snapshot"

          Between #2 and #3 the VM will be powered off. So are you sure you waited long enough until the UI is really done powering off.

          • 2. Re: Bug: VMWare WS v15.5 UI can corrupt .vmx after reverting from Snapshot
            MrCommunistGen Lurker

            Starting with a running VM, I open the Snapshot manager and revert to a snapshot of the VM in a "Powered Off" state.

             

            I think this specific flow is part of what sets up the issue to happen because the UI is doing several things all at once.

             

            In trying to make a screencap of my workflow I inadvertently reproduced the issue:

             

            the vmdk linked to my snapshot was previously "000003" but somehow when snapping back and then immediately altering the VM's specs, the file was renamed to "000004"...?  This doesn't quite line up with my theory, but at least shows a repro.  I'll see if I can get a better recording that also shows the folder contents changing while this is all happening.  This doesn't happen when *just* snapping back to the earlier snapshot.

             

            Update:
            I've done the same series of steps with both the folder and the vmx file open about 5 times now and haven't been able to get the issue to repro.  What I *have* noticed is that each time the snapshot is reverted, the vmdk file seems to alternate between "000004" and "000003".  When it works successfully it is because the filename and the value in the .vmx file match.  When there is a problem it is because the two values are out of sync -- seemingly because the vmx did not get updated with the new value.

             

            The crux of the issue seems to be the UI needing to update the .vmx file with this alternating index at the end of the vmdk filename and then also attempting to update a second value in the .vmx file.  Somehow, when making the 2nd update, it loses track of the alternating index on the .vmdk, writes the old index value instead, and BAM the vmx and vmdk are out of sync and the VM is now in a bad state.

            • 4. Re: Bug: VMWare WS v15.5 UI can corrupt .vmx after reverting from Snapshot
              MrCommunistGen Lurker

              That's really insightful and I think does a good job explaining at least part of the mechanism behind what's going on.  Thanks for finding that and linking it!

               

              In regards to the linked issue... yeah it makes sense that it'd force queue multiple snapshot-saves since a previous one still had to be written out.  There would be nothing you'd be able to do about that.

               

              In my scenario there was no snapshot saving, just a snapshot restore.  What it seems is happening in my scenario is that during the snapshot revert, the UI is writing the new vmdk name to the vmx.  Then, when the VM spec is changed, that uses a stale (cached?) version of the vmx file to write its changes to.

               

              (lightbulb emoji) Another theory into the mechanism at play:

              When reverting the snapshot the Snapshot UI (in 1 thread) updates the vmx with the new vmdk index.

              Then the main UI, with a cached version of the vmx file (?), in another thread makes a second write to the vmx file, effectively undoing the change made by the Snapshot UI thread.

               

              Since the issue only seems to happen intermittently - it would appear that there IS a mechanism that is attempting to synchronize these events, but something about its timing isn't completely dialed in, and every once in a while it fails, allowing the issue to occur.

               

              With these assumptions about how the UI functions, this is what I think would need to be changed to prevent this issue from happening:

              When the UI is making new writes to the vmx file, it needs to make sure that there are no pending writes (from another thread?) and/or that it is not making its edits to a stale version of the vmx.

              • 5. Re: Bug: VMWare WS v15.5 UI can corrupt .vmx after reverting from Snapshot
                dariusd Virtuoso
                VMware EmployeesUser Moderators

                Thanks for the diligent investigation and the detailed defect report.  (And a very calm and composed forum post, under the circumstances... it's a shocker of a defect you've found, and you'd have every reason to be rather upset about it.)

                 

                I have filed an internal problem report for this issue, so our UI engineers should investigate this soon, and I would expect that we will want to have this fixed promptly because of the potential for corrupting the .vmx file (even though it is recoverable with some manual editing, as you have found).

                 

                Thanks,

                --

                Darius

                2 people found this helpful
                • 6. Re: Bug: VMWare WS v15.5 UI can corrupt .vmx after reverting from Snapshot
                  Maxime THEPAULT Novice

                  Hi Darius,

                   

                  I have exactly the same bug with VMware Workstation 15.5.2 on 2 differents computers :

                   

                  - 1st : Intel Core i7 4790K, 16 GB RAM, 4 differents SSD, Windows 10 Enterprise LTSC 2019.

                   

                  - 2nd : Intel Core i5 7500, 16 GB RAM, 1 SSD, Windows 10 Pro (version 1909).

                   

                  Very often, when a revert to a snapshot, I can't start again my VM because it became corrupted (missing .vmdk).

                   

                  Now, I have to do many manual backups for my VMs because I already lost a lot of VMs (tests VMs, fortunately).

                   

                  Thank you for your help.

                  • 7. Re: Bug: VMWare WS v15.5 UI can corrupt .vmx after reverting from Snapshot
                    dariusd Virtuoso
                    User ModeratorsVMware Employees

                    Hi MrCommunistGen and  freebeing,

                     

                    Workstation 15.5.5 is being released right now and I just figured I would give you a heads-up that this issue is not resolved in this update.  You'll have to stay tuned for future updates, I'm afraid.

                     

                    Thanks again for your excellent problem reports and your patience.

                    --

                    Darius

                    1 person found this helpful
                    • 8. Re: Bug: VMWare WS v15.5 UI can corrupt .vmx after reverting from Snapshot
                      MrCommunistGen Lurker

                      dariusd

                      Thanks for following up on the issue last week and keeping me apprised of the situation.  Even though it hasn't yet been addressed, I'm pleased that this has garnered official attention from the VMWare organization.

                       

                      I try to be reasonable and detailed in any issues I report because as part of my job I work on both sides of our product.  I work with our internal service teams in identifying, reproducing, and documenting bugs.  I also work closely with our engineering team to make sure that issues are properly understood and that proposed fixes behave in an acceptable manner.

                       

                      Using a bug report to rant or yell at people doesn't help accomplish the ultimate goal of getting the issue fixed and serves no purpose.

                      • 9. Re: Bug: VMWare WS v15.5 UI can corrupt .vmx after reverting from Snapshot
                        MrCommunistGen Lurker

                        freebeing

                         

                        Have you tried editing the .vmx file after you experienced corruption?

                         

                        You might be able to "fix" your VMs without having to restore from backup.  At least in my situation, the vmdk line in the .vmx seems to be off +/- 1.  Since finding this, it has saved me a LOT of time having to copy my VMs from backup.  (I'd still recommend making backups just in case some other issue such as disk failure occurs).

                         

                        Good Luck!