Re: BIG bug in ESX 3.5 Update 2 - If you're using ... - Page 20

mattjk · ‎08-11-2008

The express patches have been posted. This thread is long.

Please post technical experiences here and non-technical feedback here. --JohnTroyer

Hi all,

We've just encountered a serious bug with our ESX cluster - serious enough that I thought I should post about it here as a prior warning for others running ESX 3.5 Update 2.

The VMWare tech support person we spoke to wouldn't 100% confirm whether this was / would be affecting all ESX3.5u2 installs, but he strongly alluded that it was widespread. For others sake I hope I'm wrong and it's limited.

The bug:

Starting this morning, we could not power on nor VMotion any of our Virtual Machines. The VI Client threw the error "A general system error occurred: Internal Error".

Further digging lead us to messages like this one in /var/log/vmware/hostd.log, and the log file for any virtual machine we tried to power on or VMotion:

Aug 12 10:40:10.792: vmx| This product has expired.

Aug 12 10:40:10.792: vmx| Be sure that your host machine's date and time are set correctly.

Aug 12 10:40:10.792: vmx| There is a more recent version available at the VMware Web site: "http://www.vmware.com/info?id=4".

A call to tech support confirmed this as a known problem with a temporary workaround.

The work-around:

Turn off NTP (if you're using it), and then manually set the date of all ESX 3.5u2 hosts back to 10th of August. This can be done either through the VI Client (Host -> Configuration -> Time Configuration) or by typing date -s "08/10/2008" at the Service Console command line on the ESX hosts.

As soon as the date was reset to the 10th - problem solved.

Note that running VMs were operating fine, this only seems to affect initial VM power-on (including from suspended state) and VMotion.

So, it sounds like a serious licensing bug has crept into 3.5u2. Further testing shows that the problem begins as soon as the date hits 12th August - 10th is fine, 11th is fine, 12th and the problem appears.

There wasn't any real reference to similar problems in the forums as far as I could see, but it's quite possible we're seeing this before most of the rest of the world as we're in Australia, and therefore the date here ticked over to the 12th "before" those in Europe, America, etc.

Hope this helps others... took us a couple of hours to get this far - at least we can power on VMs again though!

Cheers,

Matt Kilham

Stratton Car Finance

Message was edited by: JohnTroyer to add new thread links.

Cheers, Matt

bluedrake · ‎08-12-2008

Anyone that says we patched to early is wrong... People exploit vunerabilities in days and therefore its best to update as soon as possible, we did our testing before updating and we were happy however here I find myself working late to reinstall our esx servers to protect us from any sort of risks or esx servers falling over during office hours.

We put our faith in a company like vmware to release stable updates that can be used in multi million dollar bussinesses however today we found out we were wrong

gladney · ‎08-12-2008

Hi all, Just thought Id mention that when using update manager to patch a esx 3.5u2 server it gave me a error stating I was not licensed and uninstalled update manager from my client PC. Luckily I only have 4 of my 30 esx servers on 3.5u2.

Eagerly awaiting a patch tomorrow....

-Dave

Tibmeister · ‎08-12-2008

Wow, a lot of emotion out here today! I agree, this sucks, but, unfortunatly it happens to the best. I saw some post about poeple being foolish for applying the update so quickly, well, ponder this; you applied Update1 2 months after it was out, and then today you find it has this drop-dead date of August 12th. Where you foolish for applying that? No, not at all.

My company has a policy that patches and updates will be applied within 30 days, regardless how I feel about it. So I got hit by this bug, but, instead of panicking of attacking the QA folks at VMware, I've been pondering how to solve this problem.

So, if anyone else has posted this solution, then I am sorry for repeating.

I totally agree that changing the date is only a small step in the workaround. So, here's what I am planning to do to fix this since it seems that only the target server is effected, not the host server.

1) Change the date on 1 server in your farm. Luckily I have a spare server I can put into the farm temporarily.

2) Shut off DRS and HA

3) Manually move the VM's from 1 host to this temp host.

4) Place the host you just moved all your VM's off of in Maintenance Mode and remove from Virtual Center.

5) Reinstall ESXi U1 on the host you have removed from Virtual Center.

6) VMotion the VM's from another U2 host to the new U1 host and repeat steps 4 & 5 until all hosts in the cluster are at U1.

Problem fixed, you are now back on U1 code that AFAIK is stable. This will allow VMware to properly fix and test the U2 code before releasing it, and ensures that you are still able to run. For validated systems (those that can't have the clocked changed), I would have a discussion with your legal department to ensure that the time requirements extend all the way down to the host, or just stop at the guest OS. Most legal departments have not taken into consideration the seperation of guest and host OS in the virtual world. It's a grey area to be addressed.

pschillaci · ‎08-12-2008

I reported an issue to VMware support several months ago about poor VM performance using Intel quad core's (). After working with their support for almost 2 months proving that there actually was a problem, they went into their labs and decided that the problem could not be fixed in the current version and will have to wait until version 4.0. That will mean I will wait close to a year for a fix.

It looks like VMware is having a big problem with their QA team. They better do something about it quick.

ditro2001 · ‎08-12-2008

Hi everyone,

hope that the bug is being fixed right now. Do you know, if I can Update my ESX U2 Servers with the Update Manager to fix the problem or do I need to install the ESX Servers new???

regards

BryanMcC · ‎08-12-2008

So let me guess.. You have all Windows 2008 Servers.

Help me help you by scoring points.

PhilipArnason · ‎08-12-2008

Since this is the thread we're all hanging out in, let's get a poll going on how big of a headache this actually was for you:

1) How big is your virtual infrastructure and what advanced VMWare features do you use?

2) How many minutes of downtime did your production/test environment suffer?

3) Are all your VMs back up and running now?

4) What solution have you employed? And any other relevant information.

I'll go first.

1) 5 ESX hosts with 70 VMs. I use DRS and VMotion.

2 and 3) No downtime as I haven't shut down anything.

4) I expect to be able to wait until the end of the week when paches come through update manager. If a VM did by chance get shut down I will use the date reversal fix.

Philip Arnason

curriertech · ‎08-12-2008

1) 6 ESX Hosts, ~40 Guests, DRS, HA, etc...

2) None, I found this thread before it became an issue.

3) Yes.

4) Will implement patch upon its release.

-Josh.

Phil_White · ‎08-12-2008

Since this is the thread we're all hanging out in, let's get a poll going on how big of a headache this actually was for you:

1) How big is your virtual infrastructure and what advanced VMWare features do you use?

-

Not too big, but management has decided to put very critical servers in a virtual environment. I dont have to deal with it enough to really have issues.

2) How many minutes of downtime did your production/test environment suffer?

-

None so far, hope it stays that way.

3) Are all your VMs back up and running now?

---Never went down except for our test vm's which I already had off from the 3.5u2 update the other night.

4) What solution have you employed? And any other relevant information.

--Right now Im just waiting for VMware to fix it. If somethign happens I'll just bring up one of my nightly backups and run it from my computer. We spend a lot of effort making sure we can handle disasters properly.

mrbill007 · ‎08-12-2008

1) 6 ESX hosts, ~50 guests, hosts clustered for DRS and HA.

2) Zero.

3) All VMs are running, none of them shut down.

4) I reset the date, used VMotion to migrate VMs from one host, installed 3.5U1, rinse and repeat.

All machines are back at U1, it was a relatively painless process.

CLowe · ‎08-12-2008

Since this is the thread we're all hanging out in, let's get a poll going on how big of a headache this actually was for you:

1) How big is your virtual infrastructure and what advanced VMWare features do you use?

4 hosts - 75 vm's - DRS - VMotion

2) How many minutes of downtime did your production/test environment suffer?

0:00

3) Are all your VMs back up and running now?

Yes

4) What solution have you employed? And any other relevant information.

Waiting for the patch. Will use the date regression if necessary.

I'll go first.

1) 5 ESX hosts with 70 VMs. I use DRS and VMotion.

2 and 3) No downtime as I haven't shut down anything.

4) I expect to be able to wait until the end of the week when paches come through update manager. If a VM did by chance get shut down I will use the date reversal fix.

Philip Arnason

Tibmeister · ‎08-12-2008

8 Host servers (7 active at this time)

70 VM's (57 Production)

0 downtime - As soon as I found out, I put the word out to my group to not touch a VM until we learned more.

Still in the planning phase, waiting for more news from VMware. I have fallback plans in place using established DR procedures, and do not want to panick.

mcowger · ‎08-12-2008

1) 24 hosts, 205 VMs, DRS, VMotion, HA

2) 0:00

3) yes, never went down

4) None - we dont reboot VMs very often.

--Matt

--Matt VCDX #52 blog.cowger.us

rhunter · ‎08-12-2008

1) 5 esx3.5u2 hosts ~40-50 vms, iSCSI and FCP SAN environments

2) all day

3) I unfortunately rebooted the esx host that had my virtual center server VM thinking it would fix this issue. I then read the bug report.

4) date change workaround and disabling of NTP settings in my automation to prevent the date from being set > August 11

Jasemccarty · ‎08-12-2008

1. 6 Hosts (8 way single core/32GB Ram), 1 VirtualCenter Server, 330 VM's (all production)

2. Zero Downtime

3. Still Going

4. Haven't upgraded to 3.5 U anything yet.

Jase McCarty

http://www.jasemccarty.com

Co-Author of VMware ESX Essentials in the Virtual Data Center

(ISBN:1420070274) from Auerbach

Jase McCarty - @jasemccarty

danzbassman · ‎08-12-2008

Wow! We have the same problem. We definitely dodged a bullet though. We have a production VM we usually reboot each morning due to a problematic DLL. We didn't today. When I read this I tried to power on some downed VMs. They won't power up. This is a serious problem for us.

We have 4 ESX hosts, 75 VMs.

We use DRS, though I've set it to the most conservative setting for now.

No downtime so far, but I can't bring up VMs that were powered off.

DSTAVERT · ‎08-12-2008

This isn't the first time Beta time bombs have been left behind. They seldom live long past the release date so these things can be somewhat anticipated and planned for. Having a patch policy of applying patches based on date or without thorough understanding of what they apply to is very poor policy. Relying on some company for the safety of your data WILL get you in trouble and over and over. A good compliance auditor will tell you that it isn't just about applying patches. They MUST solve a problem that YOU have and the security risk must be something that you are at risk for or don't apply them.

The old saw "If it ain't broke don't fix it" somewhat applies although now you must truely understand what the broke is.

-- David -- VMware Communities Moderator

rlabhart · ‎08-12-2008

1) 33 Hosts/360Guests/DRS/HA/iScsi/FCP

2) 0 Downtime

3) Yes

4)Don't update until fix available.

AntonVZhbankov · ‎08-12-2008

There's no way to find this problem. QA can't do exhaustive testing just because there are many ways to implement functionality and there is almost infinite number of possible tests. I think engineer who hardcoded 12th of August and forgot to remove in release already got his "prize" from management.

EMCCAe, HPE ASE, MCITP: SA+VA, VCP 3/4/5, VMware vExpert XO (14 stars)
VMUG Russia Leader
http://t.me/beerpanda

gadeem0517 · ‎08-12-2008

I don't know about the rest of you, but I am expecting some very expensive SWAG at VMWorld next month!

All

BIG bug in ESX 3.5 Update 2 - If you're using 3.5u2 read this now! - A general system error occurred: Internal Error