Re: BIG bug in ESX 3.5 Update 2 - If you're using ... - Page 14

mattjk · ‎08-11-2008

The express patches have been posted. This thread is long.

Please post technical experiences here and non-technical feedback here. --JohnTroyer

Hi all,

We've just encountered a serious bug with our ESX cluster - serious enough that I thought I should post about it here as a prior warning for others running ESX 3.5 Update 2.

The VMWare tech support person we spoke to wouldn't 100% confirm whether this was / would be affecting all ESX3.5u2 installs, but he strongly alluded that it was widespread. For others sake I hope I'm wrong and it's limited.

The bug:

Starting this morning, we could not power on nor VMotion any of our Virtual Machines. The VI Client threw the error "A general system error occurred: Internal Error".

Further digging lead us to messages like this one in /var/log/vmware/hostd.log, and the log file for any virtual machine we tried to power on or VMotion:

Aug 12 10:40:10.792: vmx| This product has expired.

Aug 12 10:40:10.792: vmx| Be sure that your host machine's date and time are set correctly.

Aug 12 10:40:10.792: vmx| There is a more recent version available at the VMware Web site: "http://www.vmware.com/info?id=4".

A call to tech support confirmed this as a known problem with a temporary workaround.

The work-around:

Turn off NTP (if you're using it), and then manually set the date of all ESX 3.5u2 hosts back to 10th of August. This can be done either through the VI Client (Host -> Configuration -> Time Configuration) or by typing date -s "08/10/2008" at the Service Console command line on the ESX hosts.

As soon as the date was reset to the 10th - problem solved.

Note that running VMs were operating fine, this only seems to affect initial VM power-on (including from suspended state) and VMotion.

So, it sounds like a serious licensing bug has crept into 3.5u2. Further testing shows that the problem begins as soon as the date hits 12th August - 10th is fine, 11th is fine, 12th and the problem appears.

There wasn't any real reference to similar problems in the forums as far as I could see, but it's quite possible we're seeing this before most of the rest of the world as we're in Australia, and therefore the date here ticked over to the 12th "before" those in Europe, America, etc.

Hope this helps others... took us a couple of hours to get this far - at least we can power on VMs again though!

Cheers,

Matt Kilham

Stratton Car Finance

Message was edited by: JohnTroyer to add new thread links.

Cheers, Matt

razablayde · ‎08-12-2008

Granted, I haven't been in the business since MS-DOS 1 like some of these "experts" but, I haven't installed one from MS that stops production servers from booting. Programs are always going to have bugs since human error is unavoidable, but this problem seems like it should have been prevented or corrected very easily. It's as if VMware deployed a product that was timebombed as if it was a trial. That's pretty embarrassing and I could bet that some QA managers are out of jobs today. I've always been a huge backer of VMware but this is a huge black eye for a company that is looking to keep their hold on the market. Citrix Xen and Microsoft have to be licking their chops upon hearing about this fiasco.

My sympathies go out to all the admins and IT personnel that have production servers down this morning. Undoubtedly the confidence level in VMware has tken a serious hit.

jamieorth · ‎08-12-2008

I say bring back Diane - she would never let this happen...... (or maybe she put some code in their that only she could fix....) hmm, the plot thickens.......

Regards...

Jamie

If you found this information useful, please consider awarding points for "Correct" or "Helpful".

Remember, if it's not one thing, it's your mother...

jasonboche · ‎08-12-2008

I was curious if the instructor-led labs for the VI courses were on 3.5u2. If so they'd be having a non-productive week. I just talked to one of my co-workers to took the VI training last week and he doesn't think the VI racks were on u2 yet. Good news for the instructors.

Jas

[i]Jason Boche[/i]

[VMware Communities User Moderator|http://communities.vmware.com/docs/DOC-2444][/i]

VCDX3 #34, VCDX4, VCDX5, VCAP4-DCA #14, VCAP4-DCD #35, VCAP5-DCD, VCPx4, vEXPERTx4, MCSEx3, MCSAx2, MCP, CCAx2, A+

bommart · ‎08-12-2008

I'll wash the blood stain off my forehead now. Thank GOD i didn't bring down any of the vital servers.... only 3 hours wasted.

Can I bill VMWare????

dbuchanan · ‎08-12-2008

All,

Although my production environment has not been updated to 3.5 u2 as we do testing for 2 - 3 weeks prior to implementation, I have to agree with those that have voiced their dissatisfaction with Vmware and its handling of this issue. By now this will have it the mainstream news outlets and will be a major slap in the face to Vmware and its stock holders. a 36 hour turn around for a fix to a bug that has affected so many major businesses is not acceptable. I work in a banking industry and from a compliance and audit perspective have issues in playing with the time or clock settings on production systems. Also 36 hours to fix a critical problem unacceptable in almost all businesses.

Vmware markets its products on how well it can assist in a disaster and provide quick and effective failover via HA/DRS Vmotion. If a company experiences a disaster between now and the time it takes to provide this fix (as vmotionHA/DRS is now not functioning) will be at risk.

Actually they are at risk now. so what is VMwares stance to that ...........Silence.

jasonboche · ‎08-12-2008

I say bring back Diane - she would never let this happen...... (or maybe she put some code in their that only she could fix....) hmm, the plot thickens.......

Since you brought it up, for the record, it happened twice this year under Diane's reign.

[i]Jason Boche[/i]

[VMware Communities User Moderator|http://communities.vmware.com/docs/DOC-2444][/i]

VCDX3 #34, VCDX4, VCDX5, VCAP4-DCA #14, VCAP4-DCD #35, VCAP5-DCD, VCPx4, vEXPERTx4, MCSEx3, MCSAx2, MCP, CCAx2, A+

markdean · ‎08-12-2008

"from MS that stops production servers from booting."

There's a known issue on Hyper-V hosts (I don't run Hyper-V but I've read others who have had this problem) that if you have AV scan some part of where the VMs reside or where the xml files are, the host looses its knowledge of what VMs are running and you have to rebuild the files or whatever. Point is, it happens. And as far as MS specifically, true, aside from memory leaks that take time, usually if a patch is going to blow up the server, it does it right then in front of your eyes. That'll stop production systems from booting for sure and most admins who have worked with MS NT products have horror stories about it.

I guess VMware should have screwed up the other way, make it so you don't even need a license and everyone has Enterprise version, kind of like MS Office trial a few years ago that never timed out.

Mark Dean VM Computing

ML-EMP · ‎08-12-2008

Well, to summarize the issues there (quite small cluster with 20 vms and 3 hosts all U2 since last friday... should I have known...).

- Discovered the issue

- Applied the initial workaround back to 10th of August

- Discovered the time sync WAS NOT respected by the #@"!ing Tools then some VMs took back the date as well

- Alarm ringing on branch site, as 10th of august is a sunday and no one was expected to be on site. Fortunately the supervision company mentioned us the reports were caming with 10th of august so that avoided paying an intervention.

- Back to the 12th with NTP.

- And now? Please?...

I'm not blaming code guys. Everyone is doing mistakes. But this is something that would have deserved at managerial level a front page information + press information + direct communication to all ESX 3.x license managers. Confidence of customers is ALSO through transparence.

So well, crossing fingers for now... I have a 350 users Exchange box on the cluster + lots of SQL & various stuff so I would not like one of the hosts or VMs crashes...

Cheers to all the community,

--

ML

-- ML

hicksj · ‎08-12-2008

But all I'm saying is I'm glad I have a longer window for updating.

Darn straight. I'd also like to thank all you 'bleeding edgers' out there for finding this. After taking 6 months to test ESX3 over the course of Beta/GA, I determined it was best that we allow at least a month prior to future upgrades. Looks like anything >2 weeks is a good start.

That said, an "Update" version should be just that, simple updates. Its unfortunate this occured. This coupled with the issues many folks have had with vc3.5u2 cannot instill a lot of confidence in future updates for the masses.

We just logged a completely non-related SR... given the volume of this thread, I don't think we'll be expecting a response to our problem anytime soon. 😛

THP · ‎08-12-2008

@ markdean - In fairness posts in this thread are quite often mentioning that U2 sort of stealthed itself into patches by not clearly marking it when remediating for other patches - we'd never have knowingly gone to U2.

For my money that is another issue that should be addressed after the main one has been fixed along with how a bugged version of the VI Client made it out too.

Too many changes in too short a time trying to satisfy EMC/VMWare's commercial needs rather than customer's technical needs - there is no point in having bleeding edge faetures if you can't reliably run a product in a data centre.

In 3 years of running VI2.5 we've had exactly zero problems of this nature & severity. Perhaps a return to old methodologies is called for here?

mwheeler1982 · ‎08-12-2008

Yup.. I upgraded my 4 production hosts yesterday

You're right. This issue should not be the reason to change to another VM-OS. This can happen to every software company.

But: you can wait as long as you want, thre month, six month, one year. You will never be sure to get the same trouble, it might happen on the following day after your update ...

bjmoore · ‎08-12-2008

Meh, Citrix and Microsoft's products are still at least a revision

behind where VMware is. Besides, comments like that come from

environments with 20 VMs and 2 ESX hosts...not thousands of VMs and

hundreds of hosts.

Like the surgeon comment, definitely the most accurate analogy yet

LeoKurz2 · ‎08-12-2008

I think the major problem is that there is no quick way to roll back patches on ESX (apart from having a installation base in the network and the up to date installation scrips for every machine sitting there...).

__Leo

JCS725 · ‎08-12-2008

I'm seeing this issue on two of my servers. Luckily I'm only running two 3.5U2 servers and there are no guests on them yet, but they are showing up as unlicensed in the console.

Looking forward to getting the permanent solution.

astrolab · ‎08-12-2008

It's important to reiterate that it's not Update2, rather those 2 patches referenced by abrjgl (thanks).

Michelle_Laveri · ‎08-12-2008

I was curious if the instructor-led labs for the VI courses were on 3.5u2. If so they'd be having a non-productive week. I just talked to one of my co-workers to took the VI training last week and he doesn't think the VI racks were on u2 yet. Good news for the instructors.
Jas
 [i]Jason Boche[/i]
[VMware Communities User Moderator|http://communities.vmware.com/docs/DOC-2444][/i]

If they did - they would only have themselves to blame - as no course is yet certified to run on U2 at the moment. So no course should have been affected

My own lab environment at home was, I decide last week to move over to U2 on ESX3i. Mainly to test my SRM book against ESX3i, and because I had my first customer who had ESX 3i only last week...

It's taken me all day to get back to ESX 3.5 U1... and thats with scripted installations...

Regards

Mike

Regards
Michelle Laverick
@m_laverick
http://www.michellelaverick.com

admin · ‎08-12-2008

But if you go to the VMware Patch download page () and select 3.5 and click on search, you're brought straight to the ESX 3.5 patch page, where it very clearly says:

"ESX350-Update01 (ESX350-Update02) is a roll-up bundle with no binary content. It is used to install and verify all bundles in the ESX Server 3.5 Update 1 (Update 2) release. If ESX350-Update01 (ESX350-Update02) is installed using VMware Update Manager, or using an esxupdate depot, the installer will check that all Update 1 (Update 2) bundles are present, install them, and verify that the host has been brought up to the ESX Server 3.5 Update 1 (Update 2) release level. For installation using the esxupdate utility, ensure all patch bundles listed in the "Requires" column are downloaded to the esxupdate depot. For installation using Update Manager, ensure the "ESX350-Update01" ("ESX350-Update02") bundle is added to the baseline for remediation."

So every time you apply a patch, you can change the build number. And if you patch your hosts with everything, then they get incremented to Update 2. The Update 2 patch itself is only a container or a logical checker that the required patches are installed to bring you to Update 2 build version. I guess what I'm saying is, if you implememnt Update Manager, you should make sure you read through what it's doing fully!

razablayde · ‎08-12-2008

Sure but VMware is going to have to fend off companies who offer viable alternatives and have the financial backing to compete with them. If they weren't concerned with it, they wouldn't have rushed the release of 3.5u2 to compete with Hyper-v and its price.

Also, what does number of guests/hosts have to do with anything? Business critical is business critical no matter the number of deployed VMs. From what I've read, people with small to large environments have all been effected by this.

rscherer · ‎08-12-2008

Whew...well it appears that we might be okay. 22 hosts and 259 virtual machines -- I did not want there to be an issue.

bluedrake · ‎08-12-2008

We are unable to wait 36HOURS for a patch as if one database server goes down during O/H was say goodbye to money and lots of it, so rather after hours we are going to reinstall our 4 ESX servers with 3.5 update 1 and not stand the risk. Im very disapointed and wonder how VMware will ever make this up to us.

All

BIG bug in ESX 3.5 Update 2 - If you're using 3.5u2 read this now! - A general system error occurred: Internal Error