Why are my servers powering off and why are dates ...

jodytech07 · ‎02-05-2007

Please help -

I have a VM cluster on 4 HP 585's hosting about 25 VM servers. Most MS Server 2003 and some 2000 servers that were ghosted to VM. I have a handful of servers that just power off out of the blue. Nothing in the event logs that gives me a direct clue as to why they are shutting down. I thought Microsoft right off hand but it is happening to multiple servers that are not Identical. Once the servers are powered on and I check the event viewer, it is very strange. The dates and times of the logs go from today’s date to 3 - 5 months in the future or past and then return back to today’s date. SO BIZZARE!!

Example is today I got these two messages after powering on.

Event Type: Error

Event Source: EventLog

Event Category: None

Event ID: 6008

Date: 11/1/2006

Time: 11:22:09 PM

User: N/A

Computer: NAWINCTX005

Description:

The previous system shutdown at 11:40:30 AM on 2/5/2007 was unexpected.

Event Type: Information

Event Source: LGTO_Sync

Event Category: None

Event ID: 1

Date: 11/1/2006

Time: 11:22:15 PM

User: N/A

Computer: NAWINCTX005

Description:

The description for Event ID ( 1 ) in Source ( LGTO_Sync ) cannot be found. The local computer may not have the necessary registry information or message DLL files to display messages from a remote computer. You may be able to use the /AUXSOURCE= flag to retrieve this description; see Help and Support for details. The following information is part of the event: , The Driver was loaded successfully .

Event Type: Warning

Event Source: W32Time

Event Category: None

Event ID: 52

Date: 2/5/2007

Time: 11:43:01 AM

User: N/A

Computer: NAWINCTX005

Description:

The time service has set the time with offset 8252418 seconds.

For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.

Does this make any sense to anyone? Please HELP!!

Rumple · ‎02-06-2007

oops...haven't learned to post apparently.

Rumple · ‎02-06-2007

Try disabling the windows time service and have the VMWARE tools keep time for the Virtual Machines.

also make sure you have NTP setup and working on the ESX hosts

Brian_Chang · ‎03-08-2007

I'm having the same issue. I'm only running 2 ESX 2.5.4 hosts with only 3 VMs a piece. There's an 2003 AD server on each and about the same time (8am) both sessions gets restarted for no apparent reason. The Windows logs show nothing out of the ordinary except the unexpected shutdown. What's really weired tho is that when I console into the restarted sessions I'm at the PXE boot screen on POST with no operating system found. I can manually restart the session and everything comes up fine. These 2 hosts are back ended into a MSA1500cs through a Cisco MDS9020. Even more bizarre is that the other VMs on each host is not affected.

I'm hoping someone has a clue as to what might be happening.

Rob_Bohmann1 · ‎03-08-2007

We've had something similar happen to 4-5 guests on a couple of our 2.53 hosts, after applying one of the recent patches. Only 1 of the guests did this more than once (3 times me thinks) so that one got moved to 3.0 immediately to resolve it.

I also think the dates in the events view on the host in VC(1.3) were incorrect for these events. I only worked peripherally on this issue, my partner took the lead so my memory of the details are sketchy.

We have an SR open for a few weeks now. Brian, I am curious about what you are seeing, it sounds like your guests are rebooting rather then powering off.

Is this accurate or am I misreading your post ?

Brian_Chang · ‎03-08-2007

Well, this was happening on the hosts when they were still 2.5.2. They have recently been upgraded to 2.5.4 update 5 and it's still going on. Everything for about a year up until about a couple of months ago everything was fine. Then this started happening so the hosts were upgraded to deal with this but no go.

As for rebooting... the event logs on the guest systems show an unexpected shutdown so it's not the OS rebooting itself. It is possible that they are BSODing but it's weird that there's a no operating system message on POST.

I found some reference to situations where a SAN LUN may not respond quickly enough for the guest OS causing a BSOD. I've applied the fix to one of the servers to see if it corrects the problem. If so, I should see one of the servers still exhibit the issue and the other stays up. It would be consistent with what I'm seeing since they both have the issue at the same time. I'll keep you guys posted.

For reference, here's the link to the KB article:

http://kb.vmware.com/vmtnkb/search.do?cmd=displayKC&docType=kc&externalId=1014&sliceId=SAL_Public

boydd · ‎03-08-2007

I had something similar happening for about a week now - a couple of vm's powering off for no reason. There are events in both VC and ESX stating that the vm's have made an illigal request or call and will be powered off. I've narrowed this down to the LSI scsi driver in the vm's - I'm working on a fix now. Seems that a few vm's have a different driver version than the others that is unsigned (may be the culprit - and the only way you can tell is in the driver details). I've replaced the driver on one of the vm's - we'll see what happens when I get in tomorrow morning.

DB

Brian_Chang · ‎03-09-2007

Seems like all the VMs on each of our ESX servers are deciding to behave. As I stated above, I made a change to one of the two problem servers but both are still standing as of this morning. This is after 3 morning of consistent problems. Contention for the SAN is still a suspect so I'll keep everyone posted.

Brian_Chang · ‎03-09-2007

Looks like the problem is still there. The servers just went down again. Anyone have any other ideas?

boydd · ‎03-09-2007

Any errors in the esx or vc logs? Like I stated in my previous post - in my case it looks like the lsi driver had become corrupt for some reason (I upgraded the driver on one vm yesterday and so far things look promising). The errors that I had been seeing were in esx, vc and the vm (allowed me to narrow it down to the symmpi.sys driver and possibly corrupt supporting dll's). Look around the logs. I had also opened an SR for this a few days ago.

DB

Brian_Chang · ‎03-10-2007

Figured out what the problem is and it has little to do with ESX itself. Seems that our fibrechannel switch was having some issues. Basically the datastore was being disconnected and reconnected. Didn't pay attention to the fibrechannel subsystem messages on the SAN indicating a problem with the link. Miss one little thing and you spend days scratching your head.

Thanks for all the help.

All

Why are my servers powering off and why are dates in event veiwer wrong?