VMware Cloud Community
disasteraverte1
Contributor
Contributor

ESXi 4.1 on Dell R510 Crashes Every 7 Days

I have a very strange problem with a brand new Dell R510 I've deployed with ESXi 4.1.  It has a 2x1TB SATA RAID 1 via an integrated Dell SAS6 card.

I had it running just fine for 7 days, then it "crashed."  When it crashed I was able to ping the ESXi management IP, but not any guest VMs.  I was also able to gain ssh access to the server and look at logs, etc., but not virtual machine functions were working.  I tried restarting the management servers at the console, no luck.  No luck trying to do a proper reboot, either.  I had to power off/on the server to get things working.

I dug through the logs, but found nothing.  7 days later it crashed again.  Installed Dell's VIBs for ESXi, installed all the latest patches, firmware, etc., and installed a vMA VM on another ESXi host.  No hardware problems found via OpenManage or memtest.

7 days later exactly, it crashes again.  And it happens nearly to the minute 7 days later from last reboot.

The logs still show nothing that I can see, other than a gap after it "crashes," and that is only in the Vpxa log.  I also looked at the event logs on the guest VMs, and they seem to end right when ESXi crashes, so it seems to be more than just a network issue.  I suspected the licence (as if I forgot to install it), but that checks out OK.  Maybe a time sync issue?

Here is the most relevant portion of the messages.log that shows the crash, it happened about 19:08 EST on 1/18/2011 (rest are attached - I bolded where the server is silent, then shows the boot up process):
UnderstandsChunking: true CanKeepAlive: true (PresetContentLength -1)
JAN 18 19:08:21 Vpxa: [2011-01-19 00:08:29.489 1A3ABB90 verbose 'SoapAdapter.HTTPService'] User agent is 'VMware-client/4.1.0'
JAN 18 19:08:21 Vpxa: [2011-01-19 00:08:29.489 1A3ABB90 verbose 'SoapAdapter.HTTPService'] HTTP Response: Client: NeedsContentLength: false
UnderstandsChunking: true CanKeepAlive: true (PresetContentLength -1)
JAN 18 19:08:21 Vpxa: [2011-01-19 00:08:29.489 1A3ABB90 verbose 'SoapAdapter.HTTPService'] HTTP Response: Complete (processed 530 bytes)
JAN 18 19:08:21 Vpxa: [2011-01-19 00:08:29.489 1A3ABB90 verbose 'SoapAdapter.HTTPService'] HTTP Response: Complete (processed 406 bytes)
JAN 18 19:08:31 Vpxa: [2011-01-19 00:08:34.544 1A329B90 verbose 'VpxaHalCnxHostagent'] Received callback in WaitForUpdatesDone
JAN 18 19:08:31 Vpxa: [2011-01-19 00:08:34.544 1A329B90 verbose 'VpxaHalCnxHostagent'] [VpxaHalCnxHostagent::ProcessUpdate] Applying updates
from 41781 to 41782 (at 41781)
JAN 18 19:08:31 Vpxa: [2011-01-19 00:08:34.545 1A329B90 verbose 'App'] [VpxaHalVmHostagent] 144: GuestInfo changed 'guest.disk'
JAN 18 19:08:31 Vpxa: [2011-01-19 00:08:34.545 1A329B90 verbose 'App'] [VpxaHalServices] VmGuestDiskChange Event for vm(3) 144
JAN 18 19:08:31 Vpxa: [2011-01-19 00:08:34.545 1A329B90 verbose 'App'] [VpxaInvtVmChangeListener]Guest DiskInfo Changed
JAN 18 19:08:31 Vpxa: [2011-01-19 00:08:39.440 1A4F0B90 verbose 'App'] [VpxaMoVm::CheckMoVm] did not find a VM with ID 8 in the vmList
JAN 18 19:08:31 Vpxa: [2011-01-19 00:08:39.440 1A4F0B90 verbose 'App'] [VpxaAlarm] VM with vmid = 8 not found
JAN 18 19:08:41 Vpxa: [2011-01-19 00:08:49.415 1A266B90 warning 'App'] [VpxaHalStats] Unexpected return result. Expect 1 sample, receive 2
JAN 18 19:08:41 Vpxa: [2011-01-19 00:08:49.418 1A266B90 verbose 'App'] Set internal stats for VM: 2 (vpxa VM id), 21 (vpxd VM id). Is FT pri
mary? 0
JAN 18 19:08:41 Vpxa: [2011-01-19 00:08:49.420 1A266B90 verbose 'App'] Set internal stats for VM: 3 (vpxa VM id), 22 (vpxd VM id). Is FT pri
mary? 0
JAN 18 19:08:41 Vpxa: [2011-01-19 00:08:49.422 1A266B90 verbose 'App'] Set internal stats for VM: 4 (vpxa VM id), 24 (vpxd VM id). Is FT pri
mary? 0
JAN 18 19:08:41 Vpxa: [2011-01-19 00:08:49.424 1A266B90 verbose 'App'] Set internal stats for VM: 5 (vpxa VM id), 25 (vpxd VM id). Is FT pri
mary? 0
JAN 18 19:08:41 Vpxa: [2011-01-19 00:08:49.426 1A266B90 verbose 'App'] Set internal stats for VM: 6 (vpxa VM id), 26 (vpxd VM id). Is FT pri
mary? 0
JAN 18 19:08:41 Vpxa: [2011-01-19 00:08:49.429 1A266B90 verbose 'App'] Set internal stats for VM: 7 (vpxa VM id), 27 (vpxd VM id). Is FT pri
mary? 0
JAN 18 19:08:41 Vpxa: [2011-01-19 00:08:49.442 1A4F0B90 verbose 'App'] [VpxaMoVm::CheckMoVm] did not find a VM with ID 8 in the vmList
JAN 18 19:08:41 Vpxa: [2011-01-19 00:08:49.442 1A4F0B90 verbose 'App'] [VpxaAlarm] VM with vmid = 8 not found
JAN 18 20:24:00 Hostd: [2011-01-19 01:23:42.588 FF979E80 verbose 'KernelOptionsProvider'] Registered advanced option 'VMkernel.Boot.acpiDbgL
evel'

JAN 18 20:24:00 Hostd: [2011-01-19 01:23:42.588 FF979E80 verbose 'KernelOptionsProvider'] Registered advanced option 'VMkernel.Boot.allowInt
erleavedNUMAnodes'
JAN 18 20:24:00 Hostd: [2011-01-19 01:23:42.588 FF979E80 verbose 'KernelOptionsProvider'] Registered advanced option 'VMkernel.Boot.assumeCo
mmonBusClock'
JAN 18 20:24:00 Hostd: [2011-01-19 01:23:42.588 FF979E80 verbose 'KernelOptionsProvider'] Registered advanced option 'VMkernel.Boot.assumePe
rNodeBusClock'
JAN 18 20:24:00 Hostd: [2011-01-19 01:23:42.588 FF979E80 verbose 'KernelOptionsProvider'] Registered advanced option 'VMkernel.Boot.buddyPhy
sicalMemoryDebugStruct'
JAN 18 20:24:00 Hostd: [2011-01-19 01:23:42.588 FF979E80 verbose 'KernelOptionsProvider'] Registered advanced option 'VMkernel.Boot.busSpeed
MayVary'
JAN 18 20:24:00 Hostd: [2011-01-19 01:23:42.588 FF979E80 verbose 'KernelOptionsProvider'] Registered advanced option 'VMkernel.Boot.busSpeed
MayVaryPerNode'
JAN 18 20:24:00 Hostd: [2011-01-19 01:23:42.588 FF979E80 verbose 'KernelOptionsProvider'] Registered advanced option 'VMkernel.Boot.checkCPU
IDLimit'
JAN 18 20:24:00 Hostd: [2011-01-19 01:23:42.588 FF979E80 verbose 'KernelOptionsProvider'] Registered advanced option 'VMkernel.Boot.checkDMA

Certainly would be time to call VMware, but this client only has VMware Essentials and no support :(.

Any pointers would be truly appreciated!

0 Kudos
10 Replies
DSTAVERT
Immortal
Immortal

Do you have the logs writing to a datastore or remote syslog server? ESXi normally writes to a RAM disk and logs are usually lost on restart.

-- David -- VMware Communities Moderator
0 Kudos
idle-jam
Immortal
Immortal

when a server crash it's more towards hardware related issue. license and etc would just cause certain features to be disabled. a nomal crash would be PSOD in which the whole screen would be purple. For your case and something that i have faced before was the NIC card faultyness. Perhaps you could look from that angle and has Dell replace it.

0 Kudos
disasteraverte1
Contributor
Contributor

They are writting locally, but the VMware vMA is running on another ESXi server that's more stable... it collects logs every 30 seconds (shortest option).  It seems to be able to grab logs, even when it's "crashed".

0 Kudos
DSTAVERT
Immortal
Immortal

I would have the logs write to a datastore location. 30 seconds can make all the difference.

Use your Dell diagnostic tools.

Run memtest.

I don't know whether your server has DRAC but you may be able collect troubleshooting data from that.

Make sure there isn't some huge piece of machinery starting up next door that causes a power fluctuation ( just grasping)

-- David -- VMware Communities Moderator
0 Kudos
disasteraverte1
Contributor
Contributor

No DRAC, but I do have visability to this server via OpenManage (VIBs), and it shows no errors.

Is there much using running memtest with ECC memory?  I ran a quick test without error.

This box is on a UPS, so big machinery shouldn't matter...

By dell diag tools you mean a boot CD they provide / you can download?  I don't think I've ever used them, just OpenManage; is it any better?

Good tip to write logs direct to data store... is there an easy way to configure that other than setting up a syslog host?  Not that a syslog host is particularly difficult...

Nick

0 Kudos
DSTAVERT
Immortal
Immortal

I would run memtest no matter what especially prior to putting any server into production. I don't know what Dell has for diagnostics but all server manufacturers have something. These are usually boot CDs that run a series of tests directly against the hardware (no installed OS involved).

I wasn't suggesting that you do have machinery running just that you should also consider anything not just the server itself especially since you suggested that the problem happens at almost exactly the same time.

You can change the log location from the vSphere client Configuration Tab -> Software -> Advanced -> syslog. Should happen immediately.

-- David -- VMware Communities Moderator
0 Kudos
disasteraverte1
Contributor
Contributor

Looks like hardware.  Got a page this AM that it went down again, breaking the 7 day cycle.  Based on a tip someone else suggested, I cycled through all the ALT+F* consoles... when I got to ALT+F12, I saw some SAS errors (see attached).  Pursing the same with Dell.

Thanks for all the suggestions!

0 Kudos
bolsen
Enthusiast
Enthusiast

Does the RAID card have a battery?  Perhaps it's the battery test, which could be causing the write cache to turn off.

0 Kudos
disasteraverte1
Contributor
Contributor

Yes it has a battery backed cache, but no log entries suggest it kicked of a learning cycle when the crash happend.  Looking in OpenMange, there is no place to schedule the learn cycle either...

0 Kudos
bolsen
Enthusiast
Enthusiast

Just a thought - you could configure the card to use the write cache regardless of battery state.  Might be worth a shot if you're out of ideas.

0 Kudos