VMware Cloud Community
mattjk
Enthusiast
Enthusiast

BIG bug in ESX 3.5 Update 2 - If you're using 3.5u2 read this now! - A general system error occurred: Internal Error

The express patches have been posted. This thread is long.

Please post technical experiences here and non-technical feedback here. --JohnTroyer

Hi all,

We've just encountered a serious bug with our ESX cluster - serious enough that I thought I should post about it here as a prior warning for others running ESX 3.5 Update 2.

The VMWare tech support person we spoke to wouldn't 100% confirm whether this was / would be affecting all ESX3.5u2 installs, but he strongly alluded that it was widespread. For others sake I hope I'm wrong and it's limited.

The bug:

Starting this morning, we could not power on nor VMotion any of our Virtual Machines. The VI Client threw the error "A general system error occurred: Internal Error".

Further digging lead us to messages like this one in /var/log/vmware/hostd.log, and the log file for any virtual machine we tried to power on or VMotion:

Aug 12 10:40:10.792: vmx| This product has expired.

Aug 12 10:40:10.792: vmx| Be sure that your host machine's date and time are set correctly.

Aug 12 10:40:10.792: vmx| There is a more recent version available at the VMware Web site: "http://www.vmware.com/info?id=4".

A call to tech support confirmed this as a known problem with a temporary workaround.

The work-around:

Turn off NTP (if you're using it), and then manually set the date of all ESX 3.5u2 hosts back to 10th of August. This can be done either through the VI Client (Host -> Configuration -> Time Configuration) or by typing date -s "08/10/2008" at the Service Console command line on the ESX hosts.

As soon as the date was reset to the 10th - problem solved.

Note that running VMs were operating fine, this only seems to affect initial VM power-on (including from suspended state) and VMotion.

So, it sounds like a serious licensing bug has crept into 3.5u2. Further testing shows that the problem begins as soon as the date hits 12th August - 10th is fine, 11th is fine, 12th and the problem appears.

There wasn't any real reference to similar problems in the forums as far as I could see, but it's quite possible we're seeing this before most of the rest of the world as we're in Australia, and therefore the date here ticked over to the 12th "before" those in Europe, America, etc.

Hope this helps others... took us a couple of hours to get this far - at least we can power on VMs again though!

Cheers,

Matt Kilham

Stratton Car Finance

Message was edited by: JohnTroyer to add new thread links.

Cheers, Matt
0 Kudos
704 Replies
LB1
Contributor
Contributor

Our company too has compliance requirements, but luckilly we can set our date back.

But here is the better way to do it in my opinion:

Set your date back by 1 or 2 years. ( to a year that your ESX farm didn't exist)

This allows you to repair your logs easilly using a search/replace function later.

Be advised however that MOST compliance related logging occurs on the VirtualCenter server.

DO NOT CHANGE THIS DATE!

VirtualCenter server dates do not have any relation ot the problem.

Furthermore, MOST VC servers are members of Domains, and if you set this time back (even to the 10th) you will not be able to log on to the server, as Active Directory will fail on this server, and/or at the very least, the NTP client on the server will RESET the time back to correct.

Basically, there is no need to reset the VC time, but I reccomend a full year or two rollback on the ESX hosts, for ease of log repair at a later time.

0 Kudos
maishsk
Expert
Expert

Forgive me for being the bad guy here.

But as a Administrator, do you not wait for a product to stabilize before installing it into your production environment??

I for one will not install any new software - be it Microsoft Vmware, whatever... - until it has been up and running in the wild for at least a month or two after release date

Same goes for patches by the way..

Do not make the mistake of jumping on the bandwagon for new things on your production systems.

Feel sorry for all of you that were hasty here, but maybe this will be a lesson to us all...

Good luck with your nerves....


Maish

Systems Administrator & Virtualization Architect

Maish Saidel-Keesing • @maishsk • http://technodrone.blogspot.com • VMTN Moderator • vExpert • Co-author of VMware vSphere Design
0 Kudos
frank_wegner
VMware Employee
VMware Employee

Trial licenses will not help, because the licensing check itself is broken.

0 Kudos
sradnidge
Enthusiast
Enthusiast

<snip>

Be advised however that MOST compliance related logging occurs on the VirtualCenter server.

DO NOT CHANGE THIS DATE!

<snip>

And for anyone out there running VC in a VM...

0 Kudos
Erik_Zandboer
Expert
Expert

I must say I do not agree with the statement that you should wait before a release stabelizes. No one could have seen this coming. When do you upgrade? When you tested a release for three days, a week, two weeks? We should be thankfull for people upgrading that fast - and unravel any problems with releases.

Unfortunately, today is somewhat a black day for all of those pioneers...

Visit my blog at http://www.vmdamentals.com
0 Kudos
mimo17
Contributor
Contributor

Hello maishsk

Good that you know it afterwards better - you are brilliant.

We have tested intensive and found no bugs. And as an other post said. ESX is fine. It's the money makeing machine (licensing) what broke.

If you are one of the best guys in the world - help to find a solution and not complain what someone did wrong in the past.

Michael

0 Kudos
LudoS
Contributor
Contributor

Hi Joerg,

I don't know how you designed your environment, but despite of our highly redundand hardware (8 NIC, 8 SAN), UPS ,multi-path everywhere, there are still a number of failures you can not provide redundancy for (i.e. motherboard failure, CPU failure, local RAID controller itself), plus failure of ESX virtualization layer itself in this case.

To circumvent such kind of issues, VMware designed HA/DRS/VMotion. But this is not enough to protect you fully. You have to improve global redundancy at the host level by keeping N+1 redundancy. If your environment is Production only and critical, you should be able to loose one server in the cluster and restart the VM's on the remaining ones.

I'm the first to recognize that such a bug is a shame for VMware but don't blame them for poor design.

If you want to be proactive, I suggest you find the ESX server hosting the less domageable VM's you have (maybe one hosting only dev servers, or the one with the least VM's or the one where customers are the most tolerant) and stop all VM's on it (an orderly shutdown is always better than a crash), restart them elsewhere on 3.5U2 and downgrade the empty box to 3.5U1. You can then use VMotion to move VM's from another ESX 3.5U2 server (it is working, i tested it) and perform a "rolling downgrade" of your infrastructure. Re-installing ESX should not take more than one hour and you will be in a much better situation when VMware release the fix, regardless of the form it will have, ISO, RPM ...

Hope this helps, best regards,

Ludovic

0 Kudos
ab_lal
Enthusiast
Enthusiast

We used the following workaround to power on the VM's.

Find the host where a VM is located

run ' vmware-cmd -l ' to list the vms.

issue the commands:

service ntpd stop

date -s 08/01/2008

vmware-cmd /vmfs/volumes/<vm path/vmname.vmx start

service ntpd start

0 Kudos
LB1
Contributor
Contributor

I agree with your, however, I did not Update to this "update 2" level manually.

Apparently, major update releases are put in place, automatically by Update manager!

I was unaware that Update Manager would do this, i thought it only put out interim patches, and not major OS updates!

We have (now HAD) Update Manager running the updates monthly.

Apparently this got put in the update sequence for installation and I didn't see it.

I have Change Control documents in the pipeline for U2, and had expected to install in phases.

(the right way)

But now, i'm apparently all updated. (Except for VirtualCenter and VCB).

0 Kudos
LeoKurz2
Enthusiast
Enthusiast

Just my 5 cent on timesync: The VM will pick up the ESX date & time during reboot. The BIOS clock of the vm is set by ESX. When you have turned on timesync in the VMware tools, the clock of the VM will not be set back but it will be slowed down until the ESX clock "catches up". So time in VMs is affected if you use the tools to sync time.

Cheers

__Leo

0 Kudos
frank_wegner
VMware Employee
VMware Employee

Actually, one of the VMware customers I work with suggested this enhancement request:

Remove the license server from VMware completely. Most customers are honest anyway, and if someone wants to cheat he can find ways to do so, anyway.

0 Kudos
jbusink
Contributor
Contributor

Same issue here and temporarily resetting the date works. This truely is a major issue.

0 Kudos
dab
Enthusiast
Enthusiast

Cant see a patch though. I'll be opening a support request too and asking for an ETA!

Update from kb-article:

An issue with ESX/ESXi 3.5 Update 2 causes the product license to expire on August 12, 2008. VMware engineering has isolated the root cause of this issue and will reissue the various upgrade media including the ESX 3.5 Update 2 ISO, ESXi 3.5 Update 2 ISO, ESX 3.5 Update 2 upgrade tar and zip files by noon, PST on August 13. These will be available from the page: [http://www.vmware.com/download/vi|http://www.vmware.com/download/vi]. Until then, VMware advises against upgrading to ESX/ESXi 3.5 Update 2. The Update patch bundles will be released separately later in the week. This KB article will be updated as soon as more information is available, check back frequently for updates and additions.

Daniel

0 Kudos
totgate
Contributor
Contributor

""Hi FrancWest,

Everyone is mobilized here at VMware. mjlin, who posted in this thread several hours ago, is the product manager. Support knows what is going on. Someone else has posted our first communication here on this thread (patch should be available within 36 hours). Unfortunately I also can't access the kb, but I assume that posted message is from the kb.

I know we're preparing additional communication, so check that kb and expect more from us as we have more information. I'm sorry we weren't able to reach out to everyone directly yet.

John""

Unbelivable, you claim to be a enterprise solution. You really SUCK BIGTIME. 36 hrs is not acceptable, as a matter of fact its not acceptable that this happened at all. MS has always been the target for people to yell at when it comes to buggy software but they have never managed to do something like this. My company is one of the major resellers of VmWare here in Sweden but i really think thats gonna change now. You should really be ashamed of yourself for letting something like this happen, and then you have the stomach to tell us, your customers and users that we are suppose to wait for 36 hrs for afix to emerge.... You dont deserve to be in this bussiness any more. You have cost thousand of peoples and companys a lot of money this time with your little stunt... SHAME ON YOU!!!!!!!!!

/Tobbe

0 Kudos
Erik_Zandboer
Expert
Expert

Even worse, the 36-hr solution will be for ISO and upgrade TARs only... Patches are to arrive even later than that... Somewhere this week Smiley Sad

Visit my blog at http://www.vmdamentals.com
0 Kudos
leeus
Contributor
Contributor

Has someone tried setting the date forward to the 13th?

Does it "go away" tomorrow?

0 Kudos
hughs
Contributor
Contributor

I've just tried it. Still broken on the 13th.

0 Kudos
krival96
Contributor
Contributor

http://kb.vmware.com/kb/1006716

kris

www.vdi.co.nz

kris http://www.vdi.co.nz
0 Kudos
Bakafish_com
Contributor
Contributor

A little bit of calm please. This was a major screw up, but 36 hours is a really short time to fix this particular issue. As I stated in a previous post, this fix is most likely going to touch many many places in the code-base. Do you really want them to compound the problem by rushing out a broken set of patches that could be an even bigger disaster? This bug doesn't lose data, there are some simple workarounds that will allow you to boot your VM's without too much impact.

Right now the engineers are having to create a new branch off of the 3.5u2 code-base, apply the fixes for this problem, compile the entire payload, pull all the QA people from every country that ESX is released in (all different time zones) off of what they were doing to make sure they didn't break something else and put everything up on the site (making sure to remove all the legacy broken stuff.) This is usually a 2-3 week process, 36 hours is better than nothing. And if you really think Microsoft will serve you better and more effectively, well maybe you should give their product a try and see how far that gets you. We'll be here when you're done with that...

0 Kudos
dmanconi
Enthusiast
Enthusiast

If I could get to either kb site or the article itself would be nice.....All other websites are fine, just not that article or the kb.vmware.com site....

0 Kudos