VMware Cloud Community
mattjk
Enthusiast
Enthusiast

BIG bug in ESX 3.5 Update 2 - If you're using 3.5u2 read this now! - A general system error occurred: Internal Error

The express patches have been posted. This thread is long.

Please post technical experiences here and non-technical feedback here. --JohnTroyer

Hi all,

We've just encountered a serious bug with our ESX cluster - serious enough that I thought I should post about it here as a prior warning for others running ESX 3.5 Update 2.

The VMWare tech support person we spoke to wouldn't 100% confirm whether this was / would be affecting all ESX3.5u2 installs, but he strongly alluded that it was widespread. For others sake I hope I'm wrong and it's limited.

The bug:

Starting this morning, we could not power on nor VMotion any of our Virtual Machines. The VI Client threw the error "A general system error occurred: Internal Error".

Further digging lead us to messages like this one in /var/log/vmware/hostd.log, and the log file for any virtual machine we tried to power on or VMotion:

Aug 12 10:40:10.792: vmx| This product has expired.

Aug 12 10:40:10.792: vmx| Be sure that your host machine's date and time are set correctly.

Aug 12 10:40:10.792: vmx| There is a more recent version available at the VMware Web site: "http://www.vmware.com/info?id=4".

A call to tech support confirmed this as a known problem with a temporary workaround.

The work-around:

Turn off NTP (if you're using it), and then manually set the date of all ESX 3.5u2 hosts back to 10th of August. This can be done either through the VI Client (Host -> Configuration -> Time Configuration) or by typing date -s "08/10/2008" at the Service Console command line on the ESX hosts.

As soon as the date was reset to the 10th - problem solved.

Note that running VMs were operating fine, this only seems to affect initial VM power-on (including from suspended state) and VMotion.

So, it sounds like a serious licensing bug has crept into 3.5u2. Further testing shows that the problem begins as soon as the date hits 12th August - 10th is fine, 11th is fine, 12th and the problem appears.

There wasn't any real reference to similar problems in the forums as far as I could see, but it's quite possible we're seeing this before most of the rest of the world as we're in Australia, and therefore the date here ticked over to the 12th "before" those in Europe, America, etc.

Hope this helps others... took us a couple of hours to get this far - at least we can power on VMs again though!

Cheers,

Matt Kilham

Stratton Car Finance

Message was edited by: JohnTroyer to add new thread links.

Cheers, Matt
Reply
0 Kudos
704 Replies
akmolloy
Enthusiast
Enthusiast

Matt,

This doc: http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=100672... states that you need to put the server in Maintenance Mode.

-Tony

Reply
0 Kudos
awbc-au
Contributor
Contributor

Not sure if I agree with this, ESX has some of the worst error handling I have ever seen, what the hell is a General System Error tell me? Not much.

Yes I agre, the message to the console is non-descript and hence why so many of us spent a few hours trying to work out the error before we found this thread... I commented on that in a post on the 4th or 5th page... the message I was referring to is the message in the vmware log.. there are transcripts of it in the first few pages of the thread as well...

So with that, I will take my comments and concede the point.

No need to as again I totally agree, our environment doesn't have as many automated procedures for setting up our ESX's as yours, so for us this is a lengthy process.. we also install agents like the EMC Navisphere (which I know is now bundeled with the latest version anyway) so we can manage our lun mappings from a single place.. there is also some complex scripting we had created for some custom deployments etc.. so the environment is not so in tune with just reinstalling the base ESX product and bringing the VM's online.. I could do it, it just takes time and time I don't really want to spend on this issue given the patch release time was relatively short... If I did this I would have been up for 24 hrs seting everything back up again, only to have the patch released a few hours later...

I don't think it has anything to do with ESXi's free release, which was probably part of u2. I think it's just a bad coincidence.

Sorry I probably didn't word this correctly.. I meant that ESXi had the beta time bomb but ESX didn't... They time bombed ESXi because it was free and they do it on almost all their free betas I have seen that you can freely download... unfortunately because of this it got rolled up into the production ESX code when it shouldn't have.. they rolled the beta code from ESXi into the update patch for ESXi and ESX, so ESX got the benefit of the timebomb as well... great!.. at least that is how I understand it to have occured, I might be worng.....

Reply
0 Kudos
Tibmeister
Expert
Expert

Speedbmp, that is what I did for my first host. you should only have to do it for the first host.

Reply
0 Kudos
Tibmeister
Expert
Expert

WE HAVE VMOTION! I will now re-enable DRS and put my hosts in Maint mode one-by-one and apply the update.

Reply
0 Kudos
larden
Contributor
Contributor

I updated with Update Manager (had one "spare" server so I could do maintenance mode). I can power on machines but NOT Vmotion.

Operation timed out.

Any ideas anyone?

VMware Rocks!
Reply
0 Kudos
Tibmeister
Expert
Expert

Disable HA until your all good with everything. I've had HA give me fits doing this type of thing before.

Reply
0 Kudos
Gonecase
Contributor
Contributor

Use esxupdate instead of update manager to install the patch

Reply
0 Kudos
ejward
Expert
Expert

I did 1 host. There's no reboot of the host? I've got one eye on the Olympics and one on VC. Could I have missed the reboot?

Reply
0 Kudos
wwcusa
Contributor
Contributor

As a VMware Enterprise Partner and VMware Authorized Consultant

I can tell you this IS a big deal for VMware to release a product that has such grave consequences for even a relatively small portion of the total VMware user population. A small percentage does not diminish the severity of problem for affected users and the upmost urgency is expected from a company that caters to enterprise customers who don’t have “downtime” in their corporate dictionary anymore.

As said previously, bugs happen. However, I believe this could have been prevented by not rushing an update to market which was intended to be free and compete with Hyper V. VMware ran face first into the very hurdle it was trying to clear by releasing a free version of its hypervisor to compete with Microsoft’s recent release. This will no doubt teach VMware a lesson and unfortunately will cast doubt about the reliability of VMware in the enterprise. It’s a shame a clearly superior product is going to get bad publicity from this oversight.

Virtualization is here to stay, and VMware has been the leader in this arena for good reason. Let’s give them credit and hope they learn from their mistakes.

3399_3399.jpg

Reply
0 Kudos
Tibmeister
Expert
Expert

I did not see the new build# until I rebooted...

Reply
0 Kudos
awbc-au
Contributor
Contributor

Hi Guys,

I have successfully fixed the fist host using the "Update Manager" process. If you have the Update Manager plugin and installed in your VC, you can update as follows:

1. Click on "Update Manager" icon in toolbar

2. Click the "Plugins" menu item and select "Update Manager > Schedule Update Download"

3. Make sure "ESX Server Updates" is selected, click "Next"

4. Change Frequency to "Once" and then select start time of "Now"

5. Click Next > Next > Finish

Watch in the status window at bottom of screen for that to complete and then:

6. Click on "Inventory" icon, click on the impacted host and shut down all the guests

7. Once all guests are shut down put it in maint mode

8. Set the time back to the correct time if you made this workaround.

9. Right click on host and select "scan for updates"

10. Watch in the status window and once complete, right click on host and select "Remediate"

11. Make sure critical updates is selected and click next. You should notice here that there is a red cross next to critical updates stating there is one missing.

Server will now patch, once Complete, exit maint mode and start the guests back up. I did not have to reboot.

Reply
0 Kudos
awbc-au
Contributor
Contributor

I havn't checked VMotion is working, but I did see the new version number without rebooting.. HTH...

Reply
0 Kudos
larden
Contributor
Contributor

My Vmotion issue is resolved - totally unrelated someone messed up my vmotion network today - ahh

VMware Rocks!
Reply
0 Kudos
akmolloy
Enthusiast
Enthusiast

To share my experience:

I had a host with VMs on it that could tolerate limited downtime, so went for it. I had SSH connected with the esxupdate command ready to go, and then shutdown the VMs and put the host in maintenance mode. I applied the patch, restarted vmware-hostd, and then restarted the servers. I believe the downtime was under 10 minutes.

I just tested, and VMotion works to that server now from my unpatched servers, so I can put the others in Maintenance mode and patch under less pressure.

-Tony

Reply
0 Kudos
RobertGreenlee
Contributor
Contributor

I updated one of my 5 hosts and everything appears to be fine so far. As bad timing has it I upgraded the first 2 yesterday and this third one this morning which pretty much left it in a broken state since I could not power anything on with it nor could I vmotion anything over. I have disabled HA so if I lose one of my U1 servers it does not try to start VMs on U2 broken servers. I put my broken server in to maint mode and patched it using Update Manager. It worked fine although it takes a while for the post update scan.

After taking it out of maint mode I fired up a bunch of test VMs and vmotioned one over to a U1 node and back to the patched node successfully. I'm going to let this box run overnight and if there are no bad reports in the morning from you folks on the other side of the world I'll update my other U2 hosts and reenable HA.

Thanks for all the hard work getting this patch out. Unfortunately I think you've gotten a serious black eye today. We were finally getting management happy with the idea of using ESX for production servers and this set us back a little bit.

Robert

Reply
0 Kudos
jasonboche
Immortal
Immortal

Letter from CEO Paul Maritz has just been posted:

http://blogs.vmware.com/console/2008/08/letter-from-vmw.html






[i]Jason Boche[/i]

[VMware Communities User Moderator|http://communities.vmware.com/docs/DOC-2444][/i]

VCDX3 #34, VCDX4, VCDX5, VCAP4-DCA #14, VCAP4-DCD #35, VCAP5-DCD, VCPx4, vEXPERTx4, MCSEx3, MCSAx2, MCP, CCAx2, A+
Reply
0 Kudos
admin
Immortal
Immortal

The express patches have been posted. This thread is long.

Please post technical experiences here and non-technical feedback here. --JohnTroyer

I'd normally just lock this thread, but there are a lot of incoming links and a lot of people with email notifications on this one.

However, it's really unmanageable for most people, so please let's move the conversations to the above two threads.

Reply
0 Kudos
kenner
Contributor
Contributor

I'm having a serious problem with the patch. I patched one host and it was OK as I started machines on it, but then I see infinite loops of machine state changes from ON to RECONFIGURING and back to ON when I look at the latest hostd.log. They are coming from VirtualCenter Server, but I can't figure out why or how to stop it from happening. It's keeping hostd too busy to do anything else, so I can't move machines to that host.

Reply
0 Kudos
HSpeirs
Enthusiast
Enthusiast

patch now available - 107 MB download

http://www.vmware.com/go/esxexpresspatches

H.

Reply
0 Kudos
Mike_Glenn
Contributor
Contributor

The difference between the word "licensing" and "time bomb" is semantics in this case....

Agreed. Simply put, what the point is, is that in order to combat some envisioned piracy scenario, VMware installed into ESX a deliberate SPF whose as-designed failure mode is capable of turning your data center into a sea of chaos.

And, the appropriate response to even the mere existance of such a potentially devastating logic-bomb in what is supposed to be enterprise-class software can be summarized even more simply: UNACCEPTABLE.

Reply
0 Kudos