VM guest unexpected restart

hbuquis · ‎06-24-2013

\[Env] ESX Server 4.1 U1, Guest OS: RHEL 5 64bit

We have a vm guest that always reported unexpected restart but can find an issue within, but to make sure I have here the vmlogs of that guest for checking. Can someone knows how to read this logs or with this logs, do we have a configuration settings that need to be implemented?

Thank you

Jun 20 22:33:05.268: vcpu-0| VMMouse: CMD Disable

Jun 20 22:33:05.268: vcpu-0| VMMouse: Disabling VMMouse mode

Jun 20 22:33:06.532: vcpu-0| VMMouse: CMD Read ID

Jun 20 22:33:12.132: vcpu-1| Balloon: Start: vmmemctl reset balloon

Jun 20 22:33:12.132: vcpu-1| Balloon: Reset (n 9 pages 0)

Jun 20 22:33:12.132: vcpu-1| Balloon: Reset: nUnlocked 0 size 0

Jun 20 22:33:12.494: vcpu-0| GuestRpc: Channel 2 reinitialized.

Jun 20 22:33:12.569: vcpu-0| GuestRpc: Channel 1, App stopped

Jun 20 22:33:12.569: vcpu-0| GuestRpc: Reinitializing Channel 1(toolbox)

Jun 20 22:33:12.569: vcpu-0| GuestMsg: Channel 1, Cannot unpost because the previous post is already completed

Jun 20 22:33:12.570: vcpu-0| GuestRpc: Channel 1 reinitialized.

Jun 20 22:33:12.570: vcpu-0| GuestRpc: application toolbox already registered, id: -1

Jun 20 22:33:12.570: vcpu-0| GuestRpc: Channel 3, guest application toolbox.

Jun 20 22:33:12.623: vcpu-0| TOOLS ToolsCapabilityGuestTempDirectory received 1 /tmp/vmware-root

Jun 20 22:33:12.624: vcpu-0| TOOLS autoupgrade protocol version 2

Jun 20 22:33:12.626: vcpu-0| TOOLS ToolsCapabilityGuestConfDirectory received /etc/vmware-tools

Jun 20 22:33:12.626: vcpu-0| ToolsSetVersionWork did nothing; new tools version (8295) matches old Tools version

Jun 20 22:33:12.627: vcpu-0| TOOLS unified loop capability requested by 'toolbox'; now sending options via TCLO

Jun 20 22:33:12.628: vcpu-0| Guest: toolbox: Version: build-341836

Jun 20 22:33:34.257: vcpu-2| CDROM: Emulate GET CONFIGURATION RT 2 starting feature 0

Jun 20 22:33:34.261: vcpu-2| CDROM: Emulate GET CONFIGURATION RT 2 starting feature 0

Jun 20 22:33:34.295: vcpu-0| FLOPPYLIB-MAIN : CMD Checking status for non-existent Drive 0

Jun 20 22:33:58.930: vcpu-2| VMMouse: CMD Disable

Jun 20 22:33:58.930: vcpu-2| VMMouse: Disabling VMMouse mode

Jun 20 22:33:59.100: vcpu-3| SVGA: Unregistering IOSpace at 0x10d0

Jun 20 22:33:59.100: vcpu-3| SVGA: Unregistering MemSpace at 0xd4000000(0xd4000000) and 0xd8000000(0xd8000000)

Jun 20 22:33:59.104: vcpu-3| SVGA: Registering IOSpace at 0x10d0

Jun 20 22:33:59.104: vcpu-3| SVGA: Registering MemSpace at 0xd4000000(0xd4000000) and 0xd8000000(0xd8000000)

Jun 20 22:33:59.233: vcpu-0| SVGA: Unregistering IOSpace at 0x10d0

Jun 20 22:33:59.233: vcpu-0| SVGA: Unregistering MemSpace at 0xd4000000(0xd4000000) and 0xd8000000(0xd8000000)

Jun 20 22:33:59.237: vcpu-0| SVGA: Registering IOSpace at 0x10d0

Jun 20 22:33:59.237: vcpu-0| SVGA: Registering MemSpace at 0xd4000000(0xd4000000) and 0xd8000000(0xd8000000)

Jun 20 22:33:59.555: mks| Guest display topology changed: numDisplays 1

Jun 20 22:33:59.560: vcpu-2| VMMouse: CMD Read ID

Jun 20 22:33:59.560: vcpu-2| VMMouse: CMD Disable

Jun 20 22:33:59.560: vcpu-2| VMMouse: Disabling VMMouse mode

Jun 20 22:34:00.597: vcpu-0| VMMouse: CMD Read ID

SG1234 · ‎06-25-2013

hi -- whats the guest OS type? anything from the guest OS logs ? anything relevant at the same time from vmkernel logs or hostd.log

thanks,

~Sai Garimella

hbuquis · ‎06-25-2013

Hi, here is the OS Logs... I don't have related logs from vmkernel and hostd.log anymore..

Server1 kernel: (o2net,3954,0):ocfs2_dlm_eviction_cb:98 device (8,32): dlm has evicted node 1

Jun 21 06:32:02 Server1 kernel: (ocfs2rec,13935,1):ocfs2_replay_journal:1183 Recovering node 1 from slot 2 on device (8,48)

Jun 21 06:32:02 Server1 kernel: (ocfs2rec,13936,0):ocfs2_replay_journal:1183 Recovering node 1 from slot 2 on device (8,32)

Jun 21 06:32:03 Server1 kernel: kjournald starting. Commit interval 5 seconds

Jun 21 06:32:03 Server1 kernel: (ocfs2_wq,4041,0):dlm_get_lock_resource:844 34D2BC9F3CC44D529BA4EE51DE5D845C:O00000000000000000e3cb800000000: at least one node (1) to recover before lock mastery can begin

Jun 21 06:32:04 Server1 kernel: (ocfs2_wq,4041,0):dlm_get_lock_resource:898 34D2BC9F3CC44D529BA4EE51DE5D845C:O00000000000000000e3cb800000000: at least one node (1) to recover before lock mastery can begin

Jun 21 06:32:32 Server1 kernel: o2net: connection to node Server2 (num 1) at 192.168.1.112:7777 has been idle for 30.0 seconds, shutting it down.

Jun 21 06:32:32 Server1 kernel: (swapper,0,3):o2net_idle_timer:1503 here are some times that might help debug the situation: (tmr 1371767522.144532 now 1371767552.162212 dr 1371767522.144517 adv 1371767522.144537:1371767522.144538 func (00000000:0) 0.0:0.0)

Jun 21 06:32:32 Server1 kernel: o2net: no longer connected to node Server2 (num 1) at 192.168.1.112:7777

Jun 21 06:33:02 Server1 kernel: (o2net,3954,3):o2net_connect_expired:1664 ERROR: no connection established with node 1 after 30.0 seconds, giving up and returning errors.

Jun 21 06:33:26 Server1 kernel: o2net: connected to node Server2 (num 1) at 192.168.1.112:7777

Jun 21 06:33:28 Server1 kernel: ocfs2_dlm: Node 1 joins domain E6B4B739466047059B5956A013B5DB49

Jun 21 06:33:28 Server1 kernel: ocfs2_dlm: Nodes in domain ("E6B4B739466047059B5956A013B5DB49"): 0 1 2

Jun 21 06:33:32 Server1 kernel: ocfs2_dlm: Node 1 joins domain 34D2BC9F3CC44D529BA4EE51DE5D845C

Jun 21 06:33:32 Server1 kernel: ocfs2_dlm: Nodes in domain ("34D2BC9F3CC44D529BA4EE51DE5D845C"): 0 1 2

SG1234 · ‎06-25-2013

Jun 21 06:32:32 Server1 kernel: o2net: connection to node Server2 (num 1) at 192.168.1.112:7777 has been idle for 30.0 seconds, shutting it down.

looks like you have an OCFS filesystem and seems like your storage is sometimes slooow to respond -- as a workaround you can increase the hearbeat timeout in /etc/sysconfig/o2cb.conf to a bigger value say 180 seconds

HTH,

~Sai Garimella

hbuquis · ‎06-25-2013

\[Env] ESX Server 4.1 U1, Guest OS: RHEL 5 64bit

Hi, Yes we have an OCFS configured on that server and the administrators of that server will configure the heartbeat timeout. However back to vmlogs, is there any unusual that contributes with the guest restart? Thanks again

SG1234 · ‎06-25-2013

ocfs panics and reboots a node if the storage is not available with in the heartbeat time.....so this is not a VMware problem -- increase the heartbeat threshold - if possible check with SAN administrators

~Sai Garimella

All

VM guest unexpected restart