VMware Cloud Community
dhepple
Contributor
Contributor

2008 R2 servers locking up - Vsphere 5 (768111)

Hi guys,

We've recently been experiencing seemingly random lockups on some of our 2008 R2 servers running on VSphere5.  We have

5 x ESXi5 hosts in a cluster (patched to 768111) - all HP DL380 G7 servers with 294gb RAM each

All machines are virtual hardware 7 & 8 as well as all VMWare Tools being up to date

All VMs are stored on an HP EVA 6400 SAN with continuous access replication configured, with the target being a EVA P6300

2 x Brocade fabrics

Site Recovery Manager joining our Virtual Center with a DR Virtual Center.

CommVault Simpana 9 doing nightly backups

The servers in question completely stop functioning, you cannot RDP, Ping or even log onto the console and the VMWare Tools status states "Not Running" - until we do a power reset which sorts them straight away (until the next random lockup) .  We can't seem to spot any pattern on these VMs that are locking up - it's not at any particular time, or after any specific event - the only thing common about the servers that they are Windows Server 2008 R2 (we've had 2 file servers, 2 SQL servers and a development server) and that they are backed up CommVault Simpana 9.

We have been receiving these intermitten warnings on each ESXi host, which seem to be about a different datastore every time:

Device naa.6001438005dea0750000500006a10000 performance has deteriorated. I/O latency increased from average value of 2524 microseconds to 55327 microseconds. warning 09/09/2012 00:37:26

But none of these really correlate with the datastores where these troublesome VMs reside.  On the Windows event log, the last event that seems to happen (usually a good few hours before it crashes) is:


Event 8224 The VSS service is shutting down due to idle timeout.

Any ideas?  I guess the obvious factor would be that it's CommVault Simpana doing this, as it's a relatively new addition to our environment - but the lockups are so far away from the backup window, it doesn't really make sense..  Any help would be greatly appreciated

Thanks in advance


Duncan

0 Kudos
8 Replies
Linjo
Leadership
Leadership

In each of the VM:s folder on disk there should be a log-file, (unless you have unchecked that box in the configuration) have a look in that to see if there is any errors on those times.

// Linjo

Best regards, Linjo Please follow me on twitter: @viewgeek If you find this information useful, please award points for "correct" or "helpful".
0 Kudos
dhepple
Contributor
Contributor

Thanks Linjo, the log shows this just before it lockups (These times are an hour behind):


2012-08-18T17:21:25.247Z| vmx| GuestRpcSendTimedOut: message to toolbox timed out.
2012-08-18T17:21:25.247Z| vmx| Vix: [13204 guestCommands.c:2194]: Error VIX_E_TOOLS_NOT_RUNNING in VMAutomationTranslateGuestRpcError(): VMware Tools are not running in the guest
2012-08-18T17:21:33.609Z| vcpu-1| CPU reset: soft (mode 2)
2012-08-18T17:21:34.245Z| vcpu-0| SVGA: Unregistering IOSpace at 0x10d0
2012-08-18T17:21:34.245Z| vcpu-0| SVGA: Unregistering MemSpace at 0xd4000000(0xd4000000) and 0xd8000000(0xd8000000)
2012-08-18T17:21:34.247Z| vcpu-0| SVGA: Registering IOSpace at 0x10d0
2012-08-18T17:21:34.247Z| vcpu-0| SVGA: Registering MemSpace at 0xd4000000(0xd4000000) and 0xd8000000(0xd8000000)
2012-08-18T17:21:34.468Z| vcpu-0| pciBridge7:7: ISA/VGA decoding enabled (ctrl 0004)
2012-08-18T17:21:34.468Z| vcpu-0| pciBridge7:7: ISA/VGA decoding enabled (ctrl 0004)
2012-08-18T17:21:34.476Z| vcpu-0| PCIBridge4: ISA/VGA decoding enabled (ctrl 0004)
2012-08-18T17:21:34.476Z| vcpu-0| pciBridge7:7: ISA/VGA decoding enabled (ctrl 0004)
2012-08-18T17:21:34.484Z| vcpu-0| pciBridge4:1: ISA/VGA decoding enabled (ctrl 0004)
2012-08-18T17:21:34.484Z| vcpu-0| PCIBridge4: ISA/VGA decoding enabled (ctrl 0004)
2012-08-18T17:21:34.484Z| vcpu-0| pciBridge7:7: ISA/VGA decoding enabled (ctrl 0004)
2012-08-18T17:21:34.491Z| vcpu-0| pciBridge4:2: ISA/VGA decoding enabled (ctrl 0004)
2012-08-18T17:21:34.491Z| vcpu-0| pciBridge4:1: ISA/VGA decoding enabled (ctrl 0004)

This seems to happen after a ComVault backup - but we're not entirely convinced it's directly related as the server in question didn't crash until 10am this morning  (the backup ran at 22:30 and finished at 22:50).. We found a similar problem online that suggested to increase the Video RAM which we have done on this server to 128mb.  Have you seen this behaviour before?

0 Kudos
dhepple
Contributor
Contributor

This server is probably a better example:

2012-09-10T16:40:55.933Z| vmx| GuestRpcSendTimedOut: message to toolbox-dnd timed out.
2012-09-10T16:40:55.938Z| vmx| GuestRpcSendTimedOut: message to toolbox timed out.
2012-09-10T16:41:10.934Z| vmx| GuestRpcSendTimedOut: message to toolbox-dnd timed out.
2012-09-10T16:41:10.934Z| vmx| GuestRpc: app toolbox-dnd's second ping timeout; assuming app is down
2012-09-10T16:41:10.934Z| vmx| GuestRpcSendTimedOut: message to toolbox timed out.
2012-09-10T16:41:10.934Z| vmx| GuestRpc: app toolbox's second ping timeout; assuming app is down
2012-09-10T16:41:10.935Z| vmx| GuestRpc: Reinitializing Channel 2(toolbox-dnd)
2012-09-10T16:41:10.935Z| vmx| GuestMsg: Channel 2, Cannot unpost because the previous post is already completed
2012-09-10T16:41:10.935Z| vmx| GuestRpc: Channel 2 reinitialized.
2012-09-10T16:41:10.935Z| vmx| GuestRpc: Channel 2 reinitialized.
2012-09-10T16:41:10.935Z| vmx| GuestRpc: Reinitializing Channel 0(toolbox)
2012-09-10T16:41:10.935Z| vmx| GuestMsg: Channel 0, Cannot unpost because the previous post is already completed
2012-09-10T16:41:10.935Z| vmx| GuestRpc: Channel 0 reinitialized.
2012-09-10T16:41:10.935Z| vmx| GuestRpc: Channel 0 reinitialized.
2012-09-10T16:44:10.938Z| vmx| GuestRpcSendTimedOut: message to toolbox timed out.
2012-09-10T16:44:10.938Z| vmx| Vix: [1489919 guestCommands.c:2194]: Error VIX_E_TOOLS_NOT_RUNNING in VMAutomationTranslateGuestRpcError(): VMware Tools are not running in the guest

This was right before a crash yesterday afternoon (it crashed at 16:44)

0 Kudos
athlon_crazy
Virtuoso
Virtuoso

Actually similar prob happening to one of my customer and we still not able to resolve it. The latency is quite high < 50ms though no VM is hang and performance detiorated is the only error that we got in each of ESXi hosts. Both HP EVA and Falconstor NSS hardware & configuration are working fine and we feel a little bit lost right now. Weird thing is, we have never had such issue before we upgraded all hosts from 3.5 to 5.0.

http://www.no-x.org
0 Kudos
a1exp
Contributor
Contributor

We had an issue with our 2008 R2 servers if we used the full VMware tools install.

Some driver installed clashed with 2008 R2 and we got random lookups.

Removing VMware tools, reboot, reinstall using compact option, reboot and we havn't had the issue since.

Confirmed as an issue via Microsoft PSS after we'd worked out the issue and fixed if ourselves.

0 Kudos
dhepple
Contributor
Contributor

thanks - can't find any info regarding a compact option for VMWare tools though.. is it called something else?  do you know what command it uses?

0 Kudos
Josh26
Virtuoso
Virtuoso

I have the same server hardware running the same ESXi version without issues.

So just to help narrow  the issue down - it's either the SAN or the Backup software that's different in your environment.

0 Kudos
dhepple
Contributor
Contributor

yeah we're pretty sure it's CommVault Simpana that is doing this.. we've tried installing a "compact" (just mouse, network and scsi drivers etc) version of VM tools on some of the locking VMs and a complete install on others.  the VM with the compact install has locked up again, so we're hoping that the Complete install is the answer

this is very frustrating - as it's really hard to recreate the problem, the locking just appears to be random.  we've set up an hourly backup with Commvault  for 4 of the locking VMs and so far only one of them has locked up (twice) yet the others have been ok.

Thanks for your tips guys.. will let you know the outcome when we sort this

0 Kudos