VMware Cloud Community
mattjk
Enthusiast
Enthusiast

BIG bug in ESX 3.5 Update 2 - If you're using 3.5u2 read this now! - A general system error occurred: Internal Error

The express patches have been posted. This thread is long.

Please post technical experiences here and non-technical feedback here. --JohnTroyer

Hi all,

We've just encountered a serious bug with our ESX cluster - serious enough that I thought I should post about it here as a prior warning for others running ESX 3.5 Update 2.

The VMWare tech support person we spoke to wouldn't 100% confirm whether this was / would be affecting all ESX3.5u2 installs, but he strongly alluded that it was widespread. For others sake I hope I'm wrong and it's limited.

The bug:

Starting this morning, we could not power on nor VMotion any of our Virtual Machines. The VI Client threw the error "A general system error occurred: Internal Error".

Further digging lead us to messages like this one in /var/log/vmware/hostd.log, and the log file for any virtual machine we tried to power on or VMotion:

Aug 12 10:40:10.792: vmx| This product has expired.

Aug 12 10:40:10.792: vmx| Be sure that your host machine's date and time are set correctly.

Aug 12 10:40:10.792: vmx| There is a more recent version available at the VMware Web site: "http://www.vmware.com/info?id=4".

A call to tech support confirmed this as a known problem with a temporary workaround.

The work-around:

Turn off NTP (if you're using it), and then manually set the date of all ESX 3.5u2 hosts back to 10th of August. This can be done either through the VI Client (Host -> Configuration -> Time Configuration) or by typing date -s "08/10/2008" at the Service Console command line on the ESX hosts.

As soon as the date was reset to the 10th - problem solved.

Note that running VMs were operating fine, this only seems to affect initial VM power-on (including from suspended state) and VMotion.

So, it sounds like a serious licensing bug has crept into 3.5u2. Further testing shows that the problem begins as soon as the date hits 12th August - 10th is fine, 11th is fine, 12th and the problem appears.

There wasn't any real reference to similar problems in the forums as far as I could see, but it's quite possible we're seeing this before most of the rest of the world as we're in Australia, and therefore the date here ticked over to the 12th "before" those in Europe, America, etc.

Hope this helps others... took us a couple of hours to get this far - at least we can power on VMs again though!

Cheers,

Matt Kilham

Stratton Car Finance

Message was edited by: JohnTroyer to add new thread links.

Cheers, Matt
0 Kudos
704 Replies
ElmbrookDan
Enthusiast
Enthusiast

Well 6pm PST August 12th is better then 12pm August 13th

0 Kudos
COS
Expert
Expert

Just got this in emaill 2 minutes ago...

Dear VMware Customers,

Please find the latest update about the product expiration issue. From this point on, we'll provide an update every two hours. Thanks.

Problem:

An issue has been discovered by many VMware customers and partners with ESX/ESXi 3.5 Update 2 where Virtual Machines fail to power on or VMotion successfully. This problem began to occur on August 12, 2008 for customers that had upgraded to ESX 3.5 Update 2. The problem is caused by a build timeout that was mistakenly left enabled for the release build.

Affected Products:

  • VMware ESX 3.5 Update 2 & ESXi 3.5 Update 2

  • Reports of problems with ESX 3.5 U1 with the following 3.5 Update 2 patch applied.
    1. ESX350-200806201-UG

  • No other VMware products are affected.

What has been done?

  • Product and Web teams pulled the ESX 3.5 Update 2 bits from the download pages last night so no more customers will be able to download the broken build.

  • VMware Engineering teams have isolated the cause of the problem and are working around the clock to deliver updated builds and patches for impacted customers.

  • A Knowledgebase article has been published (http://kb.vmware.com/kb/1006716), but traffic to the knowledgebase is causing time outs. A new static page has been published at http://www.vmware.com/support/esx35u2_supportalert.html that customers and partners will be able to view.

  • The phone system has been updated to advise customers of the problem

  • Vmware partners have been notified of the issue.

Workarounds:

  • 1. Do not install ESX 3.5 U2 if it has been downloaded from VMware's website or elsewhere prior to August 12, 2008.

  • 2. Set the host time to a date prior to August 12, 2008. This workaround has a number of very serious side affects that could impact product environments. Any Virtual Machines that sync time with the ESX host and serve time sensitive applications would be broken. These include, but are not limited to database servers, mail servers, & domain administration systems.

Next Steps:

VMware to notify customers who have downloaded this version and provide an update every two hours.

Resolution:

VMware Engineering has isolated the root cause and is working to produce an express patch for impacted customers today. The target timeframe is 6pm, August 12, 2008 PST.

FAQ:

  • 1. What would this express patch do?

    More information will be provided in subsequent communication updates.

  • 2. Will VMware still reissue the upgrade media and patch bundles in the timeframe that has been communicated?

    Yes. We still plan to reissue upgrade media by 6pm, August 13 PST (instead of noon, August 13 PST) and all update patch bundles later in the week. We will provide an ETA for the update patch bundles subsequently. NOTE: the "patch bundles" referred to here are for the patches listed above under "Affected Products" and the other bundles released at GA. They are not the same as the express patch which is targeted for 6pm, August 12, 2008 PST as stated above.

  • 3. Why does VMware plan to reissue the upgrade media before the patch bundles? That is a wrong priority call!

    This is not a matter of priority. Since we can get done building and testing the upgrade media before the patch bundles, we want to make that available to customers first instead of reissuing all the binaries later in the week.

  • 4. Can VMware issue a patch that opens the licensing backdoor in the next hour as a critical measure?

    There is no licensing backdoor in our code.

  • 5. Does this issue affect VC 2.5 Update 2?

    No.

What is VMware doing to make sure that the problem won't happen again?

We are making improvements on all fronts. The product team had endeavored to deliver a release with support customers deem important. But we fell short and we are deeply sorry about all the disruption and inconveniences we have caused. We have identified where the holes are and they will be addressed to restore customers' confidence.

0 Kudos
Gabrie1
Commander
Commander

I agree that this happened as a side effect of Diane Greene leaving. With her gone they are more worried about pleasing the stakeholders than the actual customers so now all updates must go out on time no matter what, and this is what happens. I doubt it will happen, but I hope they learn from this.

You're not serious are you? You realy think that after Diane left, the new CEO put the pressure on the teams to get U2 out as soon as possible? Make them skip all their normal quality procedures and just throw out U2 without proper testing? It takes quite some time before changes that the new CEO wants to make come down to realy effective changes in the way a big company works.

http://www.GabesVirtualWorld.com
0 Kudos
joe70353
Contributor
Contributor

I have always found VMware's product licensing to be the worst kind of torture imagineable, but I never thought it would actually break my system.

For my department, I believe a permanent solution can be found here:

0 Kudos
hjelmar
Contributor
Contributor

"The problem is caused by a build timeout that was mistakenly left enabled for the release build." ....

Seriously??....Thats new....

"I don’t know why people hire architects and then tell them what to do.”
0 Kudos
COS
Expert
Expert

Luckily for us we were only "TestIng" U2. So no prod server affected. I guess it's safe to say that VMware is "Human". Remember, statistically everything we do as humans has 2% error in it.

lol...

0 Kudos
rollin71
Contributor
Contributor

I finally just got an email from VMWare at 4:26 CST

EXTREMELY URGENT PROBLEM WITH ESX/ESXi 3.5 Update 2 - UPDATE

Ahh well lets hope that the patch is out on time

0 Kudos
admin
Immortal
Immortal

Dear VMware Customers,

Please find the latest update about the product expiration issue. What's new in this communication update compared to the last one is what we added in FAQ 1) that we are on track to deliver an express patch by 6pm, August 12, 2008 PST. Thanks.

Problem:

An issue has been discovered by many VMware customers and partners with ESX/ESXi 3.5 Update 2 where Virtual Machines fail to power on or VMotion successfully. This problem began to occur on August 12, 2008 for customers that had upgraded to ESX 3.5 Update 2. The problem is caused by a build timeout that was mistakenly left enabled for the release build.

Affected Products:

- VMware ESX 3.5 Update 2 & ESXi 3.5 Update 2.

- The problem will be seen if ESX350-200806201-UG is applied to a system.

- No other VMware products are affected.

What has been done?

- Product and Web teams pulled the ESX 3.5 Update 2 bits from the download pages last night so no more customers will be able to download the broken build.

- VMware Engineering teams have isolated the cause of the problem and are working around the clock to deliver updated builds and patches for impacted customers.

- A Knowledgebase article has been published (http://kb.vmware.com/kb/1006716), but traffic to the knowledgebase is causing time outs. A new static page has been published at http://www.vmware.com/support/esx35u2_supportalert.html that customers and partners will be able to view.

- The phone system has been updated to advise customers of the problem

- Vmware partners have been notified of the issue.

Workarounds:

1) Do not install ESX 3.5 U2 if it has been downloaded from VMware's website or elsewhere prior to August 12, 2008.

2) Set the host time to a date prior to August 12, 2008. This workaround has a number of very serious side affects that could impact product environments. Any Virtual Machines that sync time with the ESX host and serve time sensitive applications would be broken. These include, but are not limited to database servers, mail servers, & domain administration systems.

Next Steps:

VMware to notify customers who have downloaded this version and provide an update every two hours.

Resolution:

VMware Engineering has isolated the root cause and is working to produce an express patch for impacted customers today. The target timeframe is 6pm, August 12, 2008 PST.

FAQ:

  • 1. What would this express patch do?

We are on track to deliver an express patch by 6pm August 12, 2008 PST. More information about the express patch will be provided in subsequent communication updates.

2. Will VMware still reissue the upgrade media and patch bundles in the timeframe that has been communicated?

Yes. We still plan to reissue upgrade media by 6pm, August 13 PST (instead of noon, August 13 PST) and all update patch bundles later in the week. We will provide an ETA for the update patch bundles subsequently. NOTE: the "patch bundles" referred to here are for the patches listed above under "Affected Products" and the other bundles released at GA. They are not the same as the express patch which is targeted for 6pm, August 12, 2008 PST as stated above.

3. Why does VMware plan to reissue the upgrade media before the patch bundles? That is a wrong priority call!

This is not a matter of priority. Since we can get done building and testing the upgrade media before the patch bundles, we want to make that available to customers first instead of reissuing all the binaries later in the week.

4. Can VMware issue a patch that opens the licensing backdoor in the next hour as a critical measure?

There is no licensing backdoor in our code.

5. Does this issue affect VC 2.5 Update 2?

No.

6. What is VMware doing to make sure that the problem won't happen again?

We are making improvements on all fronts. The product team had endeavored to deliver a release with support customers deem important. But we fell short and we are deeply sorry about all the disruption and inconveniences we have caused. We have identified where the holes are and they will be addressed to restore customers' confidence.

0 Kudos
COS
Expert
Expert

I was told the fix/patch is going to be available for us on Aug 13, 2008. Of course I am in PST in California.This gives them more time for us.

0 Kudos
jasonboche
Immortal
Immortal

This is really neither here nor there but wouldn't it be PDT?






[i]Jason Boche[/i]

[VMware Communities User Moderator|http://communities.vmware.com/docs/DOC-2444][/i]

VCDX3 #34, VCDX4, VCDX5, VCAP4-DCA #14, VCAP4-DCD #35, VCAP5-DCD, VCPx4, vEXPERTx4, MCSEx3, MCSAx2, MCP, CCAx2, A+
0 Kudos
COS
Expert
Expert

PST.....PDT....

Potato.....Potahtoe...

lol....

0 Kudos
bbohlen
Contributor
Contributor

1) How big is your virtual infrastructure and what advanced VMWare features do you use?

1 HA/DRS cluster with 4 hosts, 1 HA/DR cluster with 3 hosts, both attached to NetApp SANs.

2) How many minutes of downtime did your production/test environment suffer?

So far at least 8 hours. We are in the process of doing I2V/P2V/V2V over WAN to move servers in a datacenter to these clusters, and I can't boot any of the new VMs to continue the migration process. And to avoid many more hours of watching Converter percentages slowly increment, I can't power back on the production machines.

3) Are all your VMs back up and running now?

No production VMs went down.

4) What solution have you employed? And any other relevant information.

We have decided not to turn back the clocks on our hosts, even though none of our guests are syncing with them. We will apply any official patch from VMWare ASAP. We will also not apply any further patches until at least 6 months after they are released.

0 Kudos
jhanekom
Virtuoso
Virtuoso

Jason, they're already busy enough as it is with ONE time issue. Don't give them another! Smiley Wink

0 Kudos
mloeffler
Contributor
Contributor

Hello,

at this moment this Email arrived in my inbox. But, honestly, VMware don't you think this comes at least 12 hours too late ? We as VMware Partners are responsible to our customers and have already informed them by ourselves.

They would have cut our head off when we will send this kind of information 12 hours too late !

Regards, Markus

0 Kudos
jasonboche
Immortal
Immortal

Jason, they're already busy enough as it is with ONE time issue. Don't give them another! Smiley Wink

Good point.

I checked. California is currently on PDT as they do follow daylight savings time, just as we are on CDT here in Minnesota.






[i]Jason Boche[/i]

[VMware Communities User Moderator|http://communities.vmware.com/docs/DOC-2444][/i]

VCDX3 #34, VCDX4, VCDX5, VCAP4-DCA #14, VCAP4-DCD #35, VCAP5-DCD, VCPx4, vEXPERTx4, MCSEx3, MCSAx2, MCP, CCAx2, A+
0 Kudos
GraphiteDak
Enthusiast
Enthusiast

I finally had to unsubscribe.. Couldn't take much more of this.

Seems like most that are doing the serious griping have it pretty easy w/ less than 10 hosts and 50-60 vms.

I'll be up all night fixing 100+ hosts thank you... LOL

Wish me luck! Smiley Happy

0 Kudos
bluedrake
Contributor
Contributor

yeah lets just hope with the new patch you dont have to update vmware tools on all the vm's otherwise will be up all night

0 Kudos
aultl
Contributor
Contributor

Since this is the thread we're all hanging out in, let's get a poll going on how big of a headache this actually was for you:

1) How big is your virtual infrastructure and what advanced VMWare features do you use?

6 hosts - 60 vm's - DRS - VMotion- SAN

2) How many minutes of downtime did your production/test environment suffer?

0:00

3) Are all your VMs back up and running now?

Yes

4) What solution have you employed? And any other relevant information.

Waiting for VMware patch. All guestOS admins were told upon guest creation that they should sync to metal NTP server. Will use date regression if we need to reboot a VM.

0 Kudos
COS
Expert
Expert

Prod has 6 Hosts (all still on Update 0 or 1)

Using HA & DRS

29 Guests.

Our test environment is DEAD. All VM's DEAD on Two Hosts. 11 DEAD Guests

We will just wait till a fix comes up. Turning the clock back is a garbage workaround Don't do it. You'll probably break any working VM's thus adding to the misery.

.

0 Kudos
bluedrake
Contributor
Contributor

1) How big is your virtual infrastructure and what advanced VMWare features do you use?

4 hosts - 40 vm's - DRS - HA- SAN

2) How many minutes of downtime did your production/test environment suffer?

8+ hours

3) Are all your VMs back up and running now?

Yes

4) What solution have you employed? And any other relevant information.

reinstalled all hosts on now on hopefully stable version 3.5 update 1 but everything working fine now "knock on wood"

0 Kudos