dom124214
Contributor
Contributor

hgfs file corruption when using different file handles to the same file in same process, different threads

Hello all,

I'm encountering a problem when using shared folders where files that are written to by a single process are being corrupted if another thread in that same process merely reads from the same file under a separately opened file handle.

This is occurring under VMware Fusion 4.1.4 under Mac OS X 10.7.5, using CentOS 5.8 as the guest OS. The corrupted files end up with blocks of zero bytes that end up overwriting a portion of its data. The size of the blocks does not generally match the length of the missing data.

I have attached the source code to a simple C++ test program that exhibits the problem fairly consistently. A Makefile is included to build the executable.

Usually it hits the bad case on the second or third attempt when it writes out to a shared folder. It doesn't hit the bad case at all when it's outputting to the local file-system. The test program returns non-zero if it reproduces the error, so a simple shell loop can be used to continually run the program until the bad case is hit.

The program has two threads. Each thread has a separately opened file handle to the same file. The first thread opens the two file handles. It first creates a handle to a file for writing, and then it opens a read handle to read back from the file being written out to.

It then sets up the second thread, which is given a file handle for writing, and this second thread writes ever-increasing consecutive integers to the file until it is signalled to stop. After writing out a single unsigned integer, it flushes the file.

The first thread will continually use its read file handle to seek around the file to randomly chosen 64kb boundaries, and atttempt to read 64kb. This mimics behaviour in our development system where we first encountered the corruption. The first thread performs no writes of its own to the write file handle after calling fopen(), and it only calls fseek() and fread(), on the read file handle, as well as stat() on the filename. It's unknown whether the 64kb boundary reads are significant to reproducing the problem.

If SIGINT is received, or a time limit was given on the command line, the second thread is signalled to stop after writing out any current integer and a subsequent fflush(). The first thread waits on the second thread, and then attempts to verify the output file, reading unsigned integers and checking that they are consecutively numbered from 0. If there's a mismatch, and it's a 0, then more integers are read from the file until we have the full block of zeros. The file position and the number of zeroes (and the number of bytes) is then output to stdout.

Any assistance, or even just confirmation of the bug, would be greatly appreciated. It would be nice to know whether similar problems have been encountered in the past, and whether it's likely to be fixed in the near future. Any further questions, or things for me to try, please feel free to ask.

Cheers and thanks,

Dominic

45 Replies
admin
Immortal
Immortal

What I can do here is forwarding this issue to Fusion team.

could you please provide the detail information VMware Fusion(menu bar) > Help > Collect Support Information and then post the .tgz file here.

0 Kudos
dom124214
Contributor
Contributor

Hi dakangz,

Please find the support information attached.

Thank you for looking into this for us.

Cheers,

Dominic

0 Kudos
admin
Immortal
Immortal

thx, Dominic.

We got it and hope we can resolve this issue ASAP.

0 Kudos
dom124214
Contributor
Contributor

Thank you for looking into it.

I've just upgraded to VMware Fusion 5 Pro, and I can reproduce the problem, this time using Ubuntu 12.10 as the guest OS.

I've attached the new support information tarball.

Hope it helps.

Cheers,

Dominic

0 Kudos
dom124214
Contributor
Contributor

Hi dakangz,

May I ask if there's been any progress? Even if the bug has just been accepted would be good to know. Also, if there's any other assistance I can give, please let me know.

Cheers,

Dominic

0 Kudos
steve_goddard
VMware Employee
VMware Employee

Hi Dominic,

I am the developer for the Shared Folders feature, and I have been under the gun recently with lots of new stuff going on.

I have to report that the bug you have reported has been filed and accepted.

I have not had a chance to reproduce this issue and look into it just yet, but you have my word that I will do as soon as I get a chance.

We take all these issues like this very seriously and I hope that we can get a fix for the next set of releases (major and minor).

If I have any updates or issues, I will report back here.

Thanks so much for the detailed report and the sample application this goes a long way to helping.

Steve

Thanks. Steve
0 Kudos
dom124214
Contributor
Contributor

Hi Steve,

Thanks for letting us know! We'll look forward to when the bug's fixed - if there's anything I can do to help expedite things, please let us know.

Cheers,

Dominic

0 Kudos
steve_goddard
VMware Employee
VMware Employee

Hi Dominic,

I have rebuilt your test application in Ubuntu VMs and can reproduce this issue.

I have so far discovered that it is appears that not all the pages to be written are sent from the client.

It seems that after a long sequence of reads requests from the client the next page which gets sent from the client to be written has skipped one which then leaves an unwritten empty region of 4096

(which will be all zeros) bytes.

So in addition to prove this is a client side issue as I can reproduce this with Workstation on a Windows host as well as our Fusion products.

Thanks.

Steve

Thanks. Steve
0 Kudos
dom124214
Contributor
Contributor

Hi Steve,

That's great news! Thanks for looking into it. Looking forward to it being fixed, let me know if I can do anything on my end.

Cheers,

Dominic

0 Kudos
phinze
Contributor
Contributor

Hi Steve & Dominic,

Let me know if I should be making my own thread, but this issue looks very similar to what I'm seeing when trying to use a VMWare Fusion shared folder as an apt-cache. The deb files as downloaded from the Linux guest and shared to OSX host have big chunks of zeros cut out of them, which of course causes apt to freak out a little.

See here for the full discovery process: https://github.com/fgrehm/vagrant-cachier/issues/24

I'm psyched to use vagrant with vmware fusion in my work, but file corruption in shared folders is a *major* roadblock. Any movement on this issue in the past few weeks?

Thanks for your time!

Paul

0 Kudos
steve_goddard
VMware Employee
VMware Employee

Hi Paul,

Thanks for forwarding and reporting your issue. It does seem like the same issue that I am currently working on with Dominic's test application.

I am making progress but there seems to be  a couple of different issues going on.

I hope I will be able to get this addressed and a fix into the next release. I will update again here when I have more to report.

Thanks.

Steve

Thanks. Steve
0 Kudos
ziuchkovski
Contributor
Contributor

Any update on this?  I'm currently using the VMWare Fusion trial and there's no way I'll purchase when there's a silent data corruption bug like this.  I find it disconcerting that a data integrity bug like this is so low on the priority list...

0 Kudos
steve_goddard
VMware Employee
VMware Employee

Any update on this?  I'm currently using the VMWare Fusion trial and there's no way I'll purchase when there's a silent data corruption bug like this.  I find it disconcerting that a data integrity bug like this is so low on the priority list...

I am not sure why you would say or assume you know everything about our priority lists. Do you know what the priority list is? Or anything about the internal workings, resource scheduling and outstanding bugs/issues and new features of VMware Fusion or even just the Shared Folder feature itself. Or even the number of Fusion/WS users who would be affected and run into this issue against other outstanding issues with say, non-Linux VMs, like Windows?

This bug has been around an awful long time as it stands and it is only apparent when you have read and write handles open to the same file at the same time.

(Thus you will not see any corruption issues unless you have some applications/environment where you actively do this, like Dominic has. Also why our testing and quality missed this issue in the first place.)

Anyway, for those who do care, this bug has several issues with this data corruption and I have fixed some of these and checked them in. However, as there often is with issues like this, there is not one simple fix that is all encompassing. Likewise, there are some different dependencies on which Linux OS version you run, as the VFS layer has changed with every single release. I am currently testing three different Linux kernel versions running in VMs against three different host OS's: Windows 7, OS X 10.8.4 and Ubuntu 12.10.

Furthermore, I have Dominic's test application and it passes all the time. However, beyond that, there is still a random issue going on which occasionally is hit. I am still trying to track this down and fix it. When that is done, I can update here and safely state that this issue has completely been addressed.

This has been my highest priority bug and I have already spent quite a lot of time on it. We treat any data corruption of files as a very serious issue, and none have been discovered with any version of Windows OS VMs. So clearly your assumptions are wrong and ill-conceived.

Steve

Thanks. Steve
0 Kudos
ziuchkovski
Contributor
Contributor

Steve, that's a highly inflammatory response to honest feedback from a potential customer.  I'll put that aside because I'm not interested in turning this into a flame war.

I was concerned by the two month delay between reporting and acceptance, and the five month timeframe in which this has gone unresolved.  However, I am glad to hear that this bug is receiving attention.

0 Kudos
steve_goddard
VMware Employee
VMware Employee

Please reread your own comment.

I was replying to your incorrect comment and conclusion. I am not starting any flame war, but giving you an honest update and asking you questions why you arrived at the conclusion you did. If you had read my May 29th and June responses and finally my July 11th reply on this thread then you would clearly understand and see that it has been given attention and it is under investigation and actively fixed as best possible right now.

Feedback that is useful and often very helpful but to suggest that you stated what our priorities are is and that this bug is low is very presumptive to say the least.

Sorry you felt that my response was not correct but you are expected to read the thread thoroughly and see that there is activity here and you not being ignored. I am only interested in one thing, that is making this feature work with the highest quality I can.

I hope you don't abandon this product and even feature, because I have not fixed this on your timeline.


Thanks.

Steve

Thanks. Steve
0 Kudos
ziuchkovski
Contributor
Contributor

Steve, I arrived at my conclusions based on the two-month delay between the reporting of a data corruption bug and your following response:

I am the developer for the Shared Folders feature, and I have been under the gun recently with lots of new stuff going on.

I have to report that the bug you have reported has been filed and accepted.

I have not had a chance to reproduce this issue and look into it just yet, but you have my word that I will do as soon as I get a chance.

I think it's fair for a person to assume that at the point you wrote this, the bug was not a high priority.

0 Kudos
dlhotka
Champion
Champion

Do you have a paid support contract with VMWare, or are you a $39 retail customer?  In the absence of the former, perhaps a reset of expectations might be appropriate.

0 Kudos
ziuchkovski
Contributor
Contributor

I'm neither.  I'm evaluating VMware.  This started with me stating that I find it disconcerting that a data corruption bug was triaged as a low priority.  The low priority bit was my assumption based on the thread statements and timing.  Steve replied with a vehement argument that I'm mistaken, so perhaps I am.

In any event, I was merely observing that as a potential customer, it's a major turn-off to run into data corruption.  It's very scary to see any sort of data corruption bug sit open for an extended period of time.

In my case, I'm evaluating VMware Workstation and Fusion for potential use in a dev ops project with Vagrant and Chef.  I will likely recommend we stick with VirtualBox due to this bug.  Will VMware feel any sort of loss from this decision?  Probably not.  However, I'll pass on VMware enterprise offerings for the data center as well.

0 Kudos
steve_goddard
VMware Employee
VMware Employee

Hi there,

Okay, so here is an example of how it often goes, and did go in this case.

The issue is raised as a bug at some point after the being raised on the forum, (responded to by someone or not) this can be immediately or even or after a few days or a week.

In this case, the bug was raised a few days and then completely misfiled and ended in some other unrelated groups queue.

That group's queue then got mistakenly, again, not triaged (or skipped over) for quite sometime, spending most of the two months since it was raised at that queue.

A developer triaging that queue eventually realized the error and it got redirected to the correct group, i.e., me, and I responded here.

Now, that is not the normal sequence events but that is what occurred in this case.

It was never intentionally set to any "low" priority of any sort.

At that point I scheduled it into my existing workload and outstanding issues to be fixed. Including some urgent high priority issues.

I have got help from Dominic's application that he attached and reproduced the issue and then investigated the issue. Along with another report that sounded like the same issue but a different way of manifesting itself.

However, also what can typically happen on these forums is that bugs do get filed but no-one responds to the posts from the VMware developer side.

So no mention of any bug filed or even fixes are relayed back to the forum users.

So please just be aware, that even if you don't get replies here, it doesn't guarantee that the issue is not being addressed or even looked at or given a low priority.

I know that is not ideal, but that it does happen. I have even replied to issues raised on behalf of other developers, and I know other developers have chimed in on issues to do with VMware Shared Folders.

What I will say, is that if you do have VMware developers responding to the thread, and have concerns about priorities and urgency and workarounds that may help, then please just ask first. You will probably get more details back which can often help clarify things for you.

There is an awful lot of points that need to be taken into account when we schedule work and bug fixes. (I cannot enumerate them all here.)

I also will let you know, that often developers own expectations of what is deemed the highest priority is sometimes not the case for VMware developer's own recommendation aside). Often in worse cases, these not articulated back to the user either. That being said, I often try and do mention when this is the case to users and why.

Here corruption bugs are high priority, and rare, as you might expect or indeed hope. In a case where it only occurs in an OS used by very few users, then it may pushed out a little (not totally ignored) if there are equally bad issues in OS platforms used by a very high percentage of users.

You are all free to make all the assumptions you want, but if someone is responding ask first, so you don't have to assume anything. If there are no comments from VMware people, then that is are fault if you make assumptions and they are wrong.

I hope this helps at least this case, and I hope when other issues are reported too.

Please always report your issues and please give as much information as you can about reproducing it. It is often very hard to replicate issues that can rely on some specific environment which is usually not at all obvious. Smiley Happy

I also appreciate that you are placing so much importance on the VMware Shared Folder issue, as it indicates to me at least, that you do intend to make it a central part of your set up and usage. (Or I could be making the wrong assumptions there..)

Thanks.

Steve

PS I make no assumptions or distinctions on what the individual users case is in terms licenses or paying fees, but I do go on how many of our general customers from our usage data are likely to be affected by these issues, and the nature of the issue and whether it can avoided by a sensible workaround or not (along with many other issues our Program Manager raises too).

Thanks. Steve