mattjk
Enthusiast
Enthusiast

BIG bug in ESX 3.5 Update 2 - If you're using 3.5u2 read this now! - A general system error occurred: Internal Error

The express patches have been posted. This thread is long.

Please post technical experiences here and non-technical feedback here. --JohnTroyer

Hi all,

We've just encountered a serious bug with our ESX cluster - serious enough that I thought I should post about it here as a prior warning for others running ESX 3.5 Update 2.

The VMWare tech support person we spoke to wouldn't 100% confirm whether this was / would be affecting all ESX3.5u2 installs, but he strongly alluded that it was widespread. For others sake I hope I'm wrong and it's limited.

The bug:

Starting this morning, we could not power on nor VMotion any of our Virtual Machines. The VI Client threw the error "A general system error occurred: Internal Error".

Further digging lead us to messages like this one in /var/log/vmware/hostd.log, and the log file for any virtual machine we tried to power on or VMotion:

Aug 12 10:40:10.792: vmx| This product has expired.

Aug 12 10:40:10.792: vmx| Be sure that your host machine's date and time are set correctly.

Aug 12 10:40:10.792: vmx| There is a more recent version available at the VMware Web site: "http://www.vmware.com/info?id=4".

A call to tech support confirmed this as a known problem with a temporary workaround.

The work-around:

Turn off NTP (if you're using it), and then manually set the date of all ESX 3.5u2 hosts back to 10th of August. This can be done either through the VI Client (Host -> Configuration -> Time Configuration) or by typing date -s "08/10/2008" at the Service Console command line on the ESX hosts.

As soon as the date was reset to the 10th - problem solved.

Note that running VMs were operating fine, this only seems to affect initial VM power-on (including from suspended state) and VMotion.

So, it sounds like a serious licensing bug has crept into 3.5u2. Further testing shows that the problem begins as soon as the date hits 12th August - 10th is fine, 11th is fine, 12th and the problem appears.

There wasn't any real reference to similar problems in the forums as far as I could see, but it's quite possible we're seeing this before most of the rest of the world as we're in Australia, and therefore the date here ticked over to the 12th "before" those in Europe, America, etc.

Hope this helps others... took us a couple of hours to get this far - at least we can power on VMs again though!

Cheers,

Matt Kilham

Stratton Car Finance

Message was edited by: JohnTroyer to add new thread links.

Cheers, Matt
0 Kudos
704 Replies
LeoKurz2
Enthusiast
Enthusiast

@sisi: Nope, no vmotion and I don't think HA will be able to bring VM's up on a U2 ESX after a fialover. Just don't move until the patch is available Smiley Happy

0 Kudos
ESXDevil
Enthusiast
Enthusiast

This Problem would have been never recognized in testing environments.

Or do you change the date on your Test ESX servers?

0 Kudos
froaderick
Contributor
Contributor

Might as well add testing for license problems to your normal testing routine. You know management is going to insist after today.... Smiley Happy

0 Kudos
virtualesxer
Contributor
Contributor

Yeah, like this kind of thing never happens with an MS product. Give me a break. You're ready to go with a Xen copy, a standalone virtualization platform because of this?

U2 was released on 7/25 and many of you threw it on production systems that quickly? That's your bad really. I usually wait at least a three months before moving it into production farms and hammer away at it in the lab first.

That's just (or should be) standard deployment or upgrade policies no matter who or what products you are talking about.

Looks like someone left in a huge piece of auto-expiry beta/test code. That's like a surgeon accidentally leaving in a pair of scissors: it's incredibly stupid and all manner of embarrassing. End of story.

0 Kudos
markdean
Enthusiast
Enthusiast

"We did our testing, checked the forums and thought after going through our dev and test that we would upgrade a couple of the production servers, it just happened that the programmed end date for the licensing was 12th August not some other date in the future that may have caught a lot more other people!"

True, but it in this case, if you have a longer time line before moving code out to production, you didn't get hit with this. If the date was October or November, sure even my customers would be impacted. But all I'm saying is I'm

glad I have a longer window for updating.

Mark Dean VM Computing
0 Kudos
STS
Enthusiast
Enthusiast

So if you turn back time how compliant does that make you for SOX and auditing purposes. Just concerned that we would be unable to turn back time due to these issues.

0 Kudos
A13x
Hot Shot
Hot Shot

Yeah the logs are fine after i migrating it using vcb to vmware server, i did a full repair and sync and that sorted it so i had the basics up

0 Kudos
gdragats
Contributor
Contributor

usually wait for around 8 hours to make sure all replica's (naming context) is in good shape.

0 Kudos
markdean
Enthusiast
Enthusiast

I'm seeing it in my lab systems which were updated to U2 about a week after it was released so in this case, my slowness worked to my client's advantage...:-)

Mark Dean VM Computing
0 Kudos
markdean
Enthusiast
Enthusiast

"it's incredibly stupid and all manner of embarrassing. End of story."

It is that indeed.

Mark Dean VM Computing
0 Kudos
KlinikenLB
Contributor
Contributor

You're right. This issue should not be the reason to change to another VM-OS. This can happen to every software company.

But: you can wait as long as you want, thre month, six month, one year. You will never be sure to get the same trouble, it might happen on the following day after your update ...

0 Kudos
swspjcd
Enthusiast
Enthusiast

Wow. This is pretty major blunder although the workaround is fairly simple as long as you stop the OS on the virtual from coming up, before changing the time. My bigger concern is that the fix for this absolutely better be able to be installed without having to shutdown any virtuals as we would normally just vmotion them to a different host ,apply the maintenance, and vmotion them back but since vmotion is also broken, it's looking like downtime is going to be needed which I'm sure for some of you, is just not an option. We've had vmware in house since v1.x and honestly this is the first major problem we've encountered. Sure, there have been glitches along the evolution of the product but for the most part, it's been a pretty smooth ride. Previously I read somewhere that VMWARE was going to notify everyone who has downloaded U2 about this bug. I have U2 installed on 5 of our 10 ESX boxes which I did after testing on our test ESX box. I have not received any notification warning me about this bug at all. Anyone else? Are they just waiting for others to get hit by the bug?! That, to me, is even a bigger problem.

0 Kudos
hugop
Hot Shot
Hot Shot

Good thing I'm on holiday this week! But even so, I got calls from customers asking what had happened? Not a good position to be in for VMware, this could easily have been avoided. The problem is that too many updates are being released too often at the moment, only a few weeks ago we were deploying ESX 3.5, then Up 1 came along and then Up 2... Release after release to satisfy VMware Marketing for the new features... SVMotion, DPM, Live cloning etc... Good thing none of my customers are using update 2 in production yet...

0 Kudos
pb999
Contributor
Contributor

I started installing this update last week, 3 hosts were fine (of course), upgraded two more hosts this morning, and got this fault. Never imagined it could be a bug.....

I am now in deep doodoo...

0 Kudos
froaderick
Contributor
Contributor

So far I've only had one client suffer from this. This was the only client that insisted on installing U2 against my advice. But, agreed that there really isnt a way to test for this. Now that it's happened, there will be. Smiley Happy

0 Kudos
piglet
Enthusiast
Enthusiast

Folks,

If we choose to stop ntpd and change system date, we need to know whether the VMs on that host (and in general) are synching their tools to the host.

The prospect of double-clicking on loads of VMware tools icons didn't really give me the Pot Noodle Horn, so I've taken the liberty of knocking up a cheeky bit of Powershell to achieve this.

So long as you have the Windows Powershell exe installed, and the VI toolkit (this worked fine on the Beta), the following should output a list of all VMs in VC with their Display Name, IP, Host (important due to host-centric changes to ntpd and system date), and current tools sync status.

get-vm | % { get-view $_.ID } | Select-Object Name, @{ Name="IPaddress";

Expression={$_.Guest.ipaddress}} , @{ Name="Hostname";

Expression={$_.Guest.hostName}} , @{ Name="ToolsSyncTimeWithHost";

Expression={$_.Config.tools.SyncTimeWithHost}} | out-file -filepath c:\psoutput.txt

Please note that I make no warranties about effectiveness, blah, blah, but it works fine for me. Even if your company policy is not to tick the timesync box in VMware tools, there are bound to be a couple of VMs that squeak through the net.

Hope this helps a few people...

piglet

0 Kudos
Masergy
Contributor
Contributor

Hello everyone,

I just did atest in our environment and everything work fine. We are running 4 ESX server 3.5u2. The only different is that I still running the licensing server from our original installation version 2.5

Cheers

Francisco

0 Kudos
rob_nance
Contributor
Contributor

I'm sure this is a result of them going freeware, sucks for those of us who pay for it.

This is pretty massive, CNN worthy even. Luckily the only machine we rebooted is just a monitoring machine.

Tomorrow by noon is crazy talk, this needs 100 guys working on it to get it out within an hour. We had to do this update because of a licensing problem with OEM copies of ESXi, where you can't autostart your VMs upon server boot because it says you don't have the right license. It seems they have been having a lot of problems with licensing lately...

0 Kudos
pldoolittle
Contributor
Contributor

>> Not to be argumentative, but 36hrs is way unreasonable to make customers wait.

>> ESX is supposed to be an enterprise product. Enteprise products usually have 4hr

>> SLA's. No one expects vmware to fix, recompile, and distribute ESX patches in

>> under 4hrs...but there is a huge gap between 4hrs and 36.

+1. It's also not to much to expect that sales staff and SE's would start calling and emailing their enterprise customers who may be affected. Hoping that customers might check the community and might check the knowledgebase (which is down, BTW) is the kind of stuff you'd expect from mom-n-pop software company.

Vmware, If you want to be a big league player, you gotta play like you're in the big leagues.

0 Kudos
markdean
Enthusiast
Enthusiast

If memory serves me, wasn't there another problem last year with ESX or another product and it had to be fixed the next day?

I agree, you are never going to get bug free software (there's always a "known issues" section), but I've found, (and hey, based on the response, I'm one of the few who does this), that by waiting at least 2 to 3 months (all the while testing), you can get a handle on most gotchas-but certainly not everything. But had the date been further out, as has been said, then more would have been impacted-including me. It was just fortunate for me it was the 12th of August and not October.

Mark Dean VM Computing
0 Kudos