VMware Cloud Community
mattjk
Enthusiast
Enthusiast

BIG bug in ESX 3.5 Update 2 - If you're using 3.5u2 read this now! - A general system error occurred: Internal Error

The express patches have been posted. This thread is long.

Please post technical experiences here and non-technical feedback here. --JohnTroyer

Hi all,

We've just encountered a serious bug with our ESX cluster - serious enough that I thought I should post about it here as a prior warning for others running ESX 3.5 Update 2.

The VMWare tech support person we spoke to wouldn't 100% confirm whether this was / would be affecting all ESX3.5u2 installs, but he strongly alluded that it was widespread. For others sake I hope I'm wrong and it's limited.

The bug:

Starting this morning, we could not power on nor VMotion any of our Virtual Machines. The VI Client threw the error "A general system error occurred: Internal Error".

Further digging lead us to messages like this one in /var/log/vmware/hostd.log, and the log file for any virtual machine we tried to power on or VMotion:

Aug 12 10:40:10.792: vmx| This product has expired.

Aug 12 10:40:10.792: vmx| Be sure that your host machine's date and time are set correctly.

Aug 12 10:40:10.792: vmx| There is a more recent version available at the VMware Web site: "http://www.vmware.com/info?id=4".

A call to tech support confirmed this as a known problem with a temporary workaround.

The work-around:

Turn off NTP (if you're using it), and then manually set the date of all ESX 3.5u2 hosts back to 10th of August. This can be done either through the VI Client (Host -> Configuration -> Time Configuration) or by typing date -s "08/10/2008" at the Service Console command line on the ESX hosts.

As soon as the date was reset to the 10th - problem solved.

Note that running VMs were operating fine, this only seems to affect initial VM power-on (including from suspended state) and VMotion.

So, it sounds like a serious licensing bug has crept into 3.5u2. Further testing shows that the problem begins as soon as the date hits 12th August - 10th is fine, 11th is fine, 12th and the problem appears.

There wasn't any real reference to similar problems in the forums as far as I could see, but it's quite possible we're seeing this before most of the rest of the world as we're in Australia, and therefore the date here ticked over to the 12th "before" those in Europe, America, etc.

Hope this helps others... took us a couple of hours to get this far - at least we can power on VMs again though!

Cheers,

Matt Kilham

Stratton Car Finance

Message was edited by: JohnTroyer to add new thread links.

Cheers, Matt
0 Kudos
704 Replies
ElmbrookDan
Enthusiast
Enthusiast

This part specifically. "the VMs have to be vmotioned or powered off n powered on."

Why would you have to restart the VMs?

0 Kudos
esiebert7625
Immortal
Immortal

I think you'll see it shortly, they want to make sure it is tested before releasing it. It wouldn't do any good to release something that may introduce other problems. A few of us were contacted about 4 hours ago to see if we wanted to help test the patch. Hopefully it appears soon, I'm sure VMware will make an announcement to this thread as soon as it's ready.

Eric Siebert

VMware Communities User Moderator

-=-=-=-=-=-=-=-=-=-=-==-=-=-=-=-=-=-=-=-=-=-=-

Check out my website: VMware-land

Read my virtualization blog: SSV Blog

-=-=-=-=-=-=-=-=-=-=-==-=-=-=-=-=-=-=-=-=-=-=-

0 Kudos
Tibmeister
Expert
Expert

I can confirm that doing the date change works as long as the VMware Tools in the guest is not set to sync with the host. I set my host back to 8/12/2007 and Vmotioned the VM's on to it as this problem only effects the target host, not the source host.

Unless you have a "swing" host, this is the only feasable solution unless you really want to shut all your VM's down.

0 Kudos
admin
Immortal
Immortal

Dear VMware Customers,

Please find the latest update about the product expiration issue. We are staging the express patches and expect it to complete in an hour. When the staging is done, we will send out a communication with more details.

Please see FAQ 1) for details about these express patches. In FAQ 2), we describe what upgrade media and update patch bundles to be release later are for. These are the updates since our last communication.

Complete information on the ESX/ESXi 3.5 Update 2 issue follows:

Problem:

An issue has been discovered by many VMware customers and partners with ESX/ESXi 3.5 Update 2 where Virtual Machines fail to power on or VMotion successfully. This problem began to occur on August 12, 2008 for customers that had upgraded to ESX 3.5 Update 2. The problem is caused by a build timeout that was mistakenly left enabled for the release build.

The following message is displayed in the vmware.log file for the virtual machine:

This product has expired. Be sure that your host machine's date and time are set correctly.

There is a more recent version available at the VMware web site: http://www.vmware.com/info?id=4.

-


Module License Power on failed.

Affected Products:

•- VMware ESX 3.5 Update 2 & ESXi 3.5 Update 2.

  • - The problem will be seen ifESX350-200806201-UG is applied to a system.

  • - No other VMware products are affected.

What has been done?

  • - VMware removed the ESX 3.5 Update and ESXi 3.5 Update 2 binaries from the download pages in the evening of August 11, 2008 PST.

  • - VMware Engineering teams have isolated the cause of the problem and are working around the clock to deliver updated builds and patches for impacted customers.

  • - A Knowledgebase article has been published (http://kb.vmware.com/kb/1006716) and is being refreshed regularly.

Resolution:

VMware Engineering has produced express patches for impacted customers that will resolve the issue.

FAQ:

1. What will the express patches do?

There are two express patches: one for ESX 3.5 Update 2 and one for ESXi 3.5 Update 2. They are specifically targeted for customers who have installed or fully upgraded to ESX/ESXi 3.5 Update 2 or who have applied the ESX350-200806201-UG patch to ESX/ESXi 3.5 or ESX/ESX 3.5 Update 1 hosts. For customers who haven't done either, these express patches should not be applied.

To be noted is that these patches have been validated to work with esxupdate. However, testing with the VMware Update Manager is still under way. In subsequent communications, we will provide confirmation whether the patches work with VMware Update Manger or if a re-spin is required.

To apply the patches, no reboot of ESX/ESXi hosts is required. One can VMotion off running VMs, apply the patches and VMotion the VMs back. If VMotion capability is not available, VMs need to be powered off before the patches are applied and powered back on afterwards.

We are currently testing an option to apply the patch without requiring VMotion or VM power-off and re-power-on at the point of patch application. To immediately refresh vmx on the VM, one can VMotoin off running VMs, apply the patches and VMotion the VMs back. If VMotion capability is not available, VMs can be powered off before the patches are applied and powered back on afterwards.

2. When will VMware reissue the upgrade media and patch bundles?

VMware plans to reissue upgrade media by 6pm, August 13 PST and all update patch bundles later in the week. We will provide an ETA for the update patch bundles subsequently.\

NOTE:

  • An upgrade media refers to ESX 3.5 Update 2 ISO, ESXi 3.5 Update 2 ISO, ESX 3.5 Update 2 upgrade tar and zip files. They are for customers who haven't installed or upgraded to ESX/ESXi 3.5 Update 2 but wish to.

  • The "patch bundles" here refer to those released at GA. They are for customers who do not wish to do a full upgrade to ESX/ESXi 3.5 Update 2, but apply patches that are deemed necessary to hosts running ESX/ESXi 3.5 or ESX/ESXi 3.5 Update 1. They are not the same as the express patch which is described above.

3. Why does VMware plan to reissue the upgrade media before the patch bundles?

Since we can complete building and testing of the upgrade media before the patch bundles, we want to make that available to customers right away instead of reissuing all the binaries later in the week.

4. Can VMware issue a patch that opens the licensing backdoor in the next hour as a critical measure?

There is no licensing backdoor in our code.

5. Does this issue affect VC 2.5 Update 2?

No.

0 Kudos
DLeeSFI
Contributor
Contributor

It is too late to help me. My downtime window is shot. So now my vacation starting tomorrow is shot with it.

0 Kudos
dj1
Contributor
Contributor

I don't see why everyone is so surpised, agreed the issue is pretty bad but as with any software occasionally you will find bugs (some small, some big 😛 ouch! ), because ultimately you have people who write the software and people who either test or build the regression test routines/scripts for software. To expect software to always be 100% reliable from version to version or patch to patch is a bit of an unrealistic strech.

The most interesting fallout from this from my perspective is not so much the bug but the trade-off an organisation/corporation/etc... needs to accept by standarding on a singlular platform for the sake of increased operational efficiency VERSUS the risk such a model carries when all your services are bundled in the same basket.

The case and point can also be seen in GMAIL which services also went down this week due to a bug, check out: http://gmailblog.blogspot.com/2008/08/we-feel-your-pain-and-were-sorry.html

Not sure there is an easy fix, because chances are future disruptions to various degrees will happen again and will just need to be managed via plan A, plan B, plan C, etc...

Otherwise VMWARE still rules, they'll learn a lot out of this and i'm pretty sure it'll be a long time before another bug of even half the magnitude appears, for our environment the servers that we were trialing update 2 on, we just rolled back the environment to ESXi Update 1 and all is well.

0 Kudos
ejward
Expert
Expert

This part specifically. "the VMs have to be vmotioned or powered off n powered on."

Why would you have to restart the VMs?

I think it's an either or thing. If you can't vmotion, you'd have to power off the VMs. But, just the fact that you have to do either tells me the host requires a reboot.

0 Kudos
ElmbrookDan
Enthusiast
Enthusiast

Thanks, its makes sense now that the way the official release states its.

0 Kudos
finebanana
Contributor
Contributor

We just bought a brand new server, delivered to us last week, and I decided to try out VMware EXSi 3.5u2 on it. I have never used any VMware hypervisor product before.

After running it for a couple of days, I powered off the host on Aug 12 since I wanted to move it to our rack, and when the host is up again, none of the VM can be started. I panicked for a while until I googled the error and found a blog with the workaround.

Thank heaven this is not (yet) our production server! how can an enterprise-grade product can suffer from a show-stopping bug of this scale?

0 Kudos
awbc-au
Contributor
Contributor

RParker you misinterpret what I meant...

I'm saying that the bug isn't with ESX, but the fact ESX can't get a valid license (it's thinks it's expired).... this is why it stops working... fix the license issue (yes it's in code, I get it) and the ESX works again...

In my opinion licensing shouldnt stop the core function of booting a server... it's so dangerous in general to have that in place.. why not do it like MS and cripple it or something... maybe run the VM at half speed or restrict some features, but stopping the VM from booting (which is the entire point of the software) is crazy....

we are in this position because VMWare want to protect their software from pirates, not because of a fault with the product. Do it a better way.... protect the customers who actually pay a great deal of money for it... ESX is enterprise based software, not some graphics program or document editor that is peddled around on torrent sites... you need specialised hardware to run it which is not in the domain of the general pirate trying to fleece your software... this fact alone means most compaines will license it... your not going to spend 50k plus on hardware only to rip off your copy of vmware are you??....

If this happenned on photoshop or something it would be far less critical.. the fact a license issue stops an entire production network that generates millions of dollers for some compaines is just outrageous... for the most part people can put the workaround in as described and it's not a big issue, but other companies like ecommerce / banks / lottery etc this is not an option.. there are laws to stop tampering with dates on systems and this a sector this enterprise product is aimed out - and one that vmware even uses to promote the reliability of it's products on its website...

remove the 2 bit licensing model from it and do it another way... it's hard to tell management you cant boot the servers that make the money because of a license expiry issue when your all paid up.... the fact the "shareware" version of code was implemented in the update itself is laughable...

0 Kudos
esiebert7625
Immortal
Immortal

It's not really a bug per se, it's more of a procedural issue that was missed as part of the transisition of the beta software to the final build. Obviously the time bomb code should of been disabled or removed prior to the final build and it wasn't. I think the fallout from this event will guarantee that changes are made to make sure that something like this never happens again.

Eric Siebert

VMware Communities User Moderator

-=-=-=-=-=-=-=-=-=-=-==-=-=-=-=-=-=-=-=-=-=-=-

Check out my website: VMware-land

Read my virtualization blog: SSV Blog

-=-=-=-=-=-=-=-=-=-=-==-=-=-=-=-=-=-=-=-=-=-=-

0 Kudos
finebanana
Contributor
Contributor

Hi awbc-au,

I second your opinion on this, so to describe it in not so many words: IT SUCKS.

0 Kudos
akmolloy
Enthusiast
Enthusiast

Some constructive criticism for VMware:

Communication: I found out about this issue by reading Slashdot today. 2 hours later, I finally got an e-mail from VMWare about it. Please consider dedicating more resources towards communication with your customers for critical issues.

VMWare tools management: Please consider building in centralized management of VMWare tools settings from Virtual Center. To prepare for this patch, I needed to make sure all VMs were not syncing time with the hosts. Since most of our VMs are managed by others in my department, I had to contact all the sysadmins and ask them to change the VMWare tools settings.I can only imagine the pain felt by big installations that needed to change lots of VM tools settings.

0 Kudos
Tibmeister
Expert
Expert

Everyone is getting this confused with a licensing problem, which it is not. It is a Beta-Build time-bomb, which companies typically put in to prevent thier Beta builds from running after a certian date. This practice has been in place for years and is usually commented very plainly. This is a case where that time-bomb wasn't removed for what ever reason, and instead of just shutting down the hypervisor, it just crippled it by not allowing Vmotion to the target, this we know because Vmotion from a box still works, but what good is that if the host cannot accept the guest. Since the guest VM is on a host that is managed by VC, you must have the proper licensing, which an expired product will not take the proper licensing. If you remove the host from VC, I would bet that the VM's will start up no problem with the host in stand-alone mode.

So it is very easy and logical from a programmer stand-point to see how this happened, and yes, QA could've caught it and should have, but then to err is to be human. Also, it isn't stopping the VM's from running, it's just making the host act like a bad GSX server ( I did have one that the VM's would not restart on properly), at least in my humble opinion.

Also, someone stated earlier about thier manager breathing down thier neck; one thing I've learned is even a tech savvy manager may not know the core of the technology and reacts based on your reaction, so if you are calm and explain it rationaly, then the manager will be calm. I understand this is a freaky situation, I definatly got anxious this morning, but as I grabbed the trusty dry-erase markers, got the tropps together and did some brain-storming, I realized that it is not a show stopper, just a major inconvienance. I can bring up a VM host from a cold-boot, I just have to take a couple extra steps to do this. Yes, the time will be off on the VM, then you just log in as a local administrator and reset the time, no biggie. That takes an extra 5 minutes of your time to do this, and with planning, hopefully no body is shutting VM's down and needing to do this.

I personally am watchinig my Virtual Center like a hawk, but sometimes you just have to pull up the sleeves, forget about 9 to 5, and get a little dirty.

0 Kudos
rmumford
Contributor
Contributor

Over 500 VM's here and not in good spirits Smiley Sad

0 Kudos
awbc-au
Contributor
Contributor

the difference between the word "licensing" and "time bomb" is semantics in this case... even the vmware logs use the term licensing in the error message they output when the vm can't be started...

I agree that technically the issue is caused by the time bomb, which revokes the "license", which causes the guests not to boot.. the problem is with the licensing code, whether it be a legitimate expired license or a timebomb it's the same thing and the result is the same.... if you forget to renew your license the same problem occurs, if your VC host goes down for a number of days (cant remember exaclty how many) and the host can't contact it this will also occur as well... VMWare wont let you start your guests until you re-license....

Agreed the issue is an inconvience as works arounds can be put in place.. I can set the time back, rebuild the ESX server or even try and uninstall the U2 patch (as someone else mentioned) and all of these things will allow me to get the servers back online.. The issue with all of these things is either technical (ie you can't set time back for other reasons) or resourcing (ie you need to have someone perform those tasks, and I doubt an ESX rebuild can be done in 30 mins as someone commented...) by the time you reattach all the LUN's, setup agents, enable the SSH etc..etc.. this all takes time and needs to be paid for... especially compaines that don't have ESX skills in house and outsource.... Is VMWare going to cover those costs?

Everyone is different,.. for some people it's not a big deal and they have commented as such with all the "dont get to worked up" posts.. for other people it is a major problem and is causing downtime that in some cases they can't control.... either way everyone is awaiting the patch and communication so we can get back to a reliable base again... personally we have very complex servers with scripts and all sorts of stuf that will take a long time to reconfigure... we also have a lot of them and to do it on all would take days.... It's a real shame that benefits for VMWare in having a common code library between the ESX and ESXi products means a time bomb meant for the free version slipped into the production ESX codebase and caused this error. Thats probably the biggest issue for VMWare to deal with, and hence my comment that a change in the licensing restrictions needs to be put in place, especially for the flagship ESX server, which is the product that generates the coin to be able to build the free ones i nthe first place..

0 Kudos
Tibmeister
Expert
Expert

Definatly understand that rmumford, please don't take my post that you shouldn't be upset, it just seemed like emotions were getting in the way of some people seeing how to recover from this. Now that I have a confirmed way of getting around this, I got a chuckle out of knowing that the folks at VMware are not super-human, just human. I used to do a ton of programming and had this happened to me once before, so It's kinda funny to see it happen to a major company. Although it very annoying, it is a real possibility when dealing with technology.

0 Kudos
rollin71
Contributor
Contributor

We chose our jobs knowing that times like this always pop up but never at a convenient time.

At least i didn't have any vacation scheduled although i might take some after this.

0 Kudos
sam_god
Contributor
Contributor

Helpful Tip: As the load on cluster members change DRS tries to VMotion VMs accross host. With the current problem VMotion is failing with "Operation Timeout" Error messages. To avoid this Select your DRS cluster and set VMware DRS Automation Level to MANUAL. This will stop un-necessary failing VMotions.

After that access the VMs on the Host through console and open VMware Tools, remove sync time with host. For windows machine set time server to external time server in internet time. for Linux use ntpd. Once All VMs time update from ESX stopped then change time of ESX server to 01/AUG/2008. Now we should be evacuate one host (other may get overloaded). Then on evacuted host we will be able to apply the patch and remediate the evacuated host. Set the appropriate time on the remediated host or set it to use Time Server. Now instead of moving orginal VMs back from overloaded host move VMs from another affected host to remediated host. This will ensure that no changes in VMware Tools for those VMs for time sync setting required as detination ESX host time is correct.

We are still awaiting for patch at this moment, so we are unable to test the last part of this activity. But if you can migrate VMs to ESX 3.5 U1 then it should work with U2 also. I do not have any servers of older version than U2. So can any body test a migration of a test VM from affected system to ESX 3.5 U1 host.

If the migrations are successful we will be able to apply express patch on all the hosts one by one this way and remediate the systems quickly.

0 Kudos
mattjk
Enthusiast
Enthusiast

Patch is out:

Special Notice: Please Read

An issue has been uncovered with ESX/ESXi 3.5 Update 2 that causes the product license to expire on August 12, 2008. Follow the steps below to correct this issue:

1. Read the following Knowledge Base articles first:

* Fix of virtual machine power on failure issue, refer to KB 1006716

* For VI 3.5, refer to KB 1006721 for deployment consideration and instruction

* For VI3.5i, refer to KB 1006670 for deployment consideration and instruction

2. Download and apply the patch according to the product(s) you have:

VMware ESXi 3.5 Update 2 Patch | VMware ESX 3.5 Update 2 Patch

Cheers,

Matt Kilham

Stratton Finance

Cheers, Matt
0 Kudos