VMware Cloud Community
pmcdonnell
Enthusiast
Enthusiast

Intermittent vCenter/ESXi host disconnect issue

Configuration: vCenter Virtual Appliance running 5.1.0.  2 x identical ESXi hosts running 5.0U1.  VCVA is running on ESXi host 'a'.

Symptom: vCenter complains of a disconnect in communication with the host, but automatically reconnects after a moment.  ESXi hosts do not experience any issues - no reboots, no interruption to running VMs on them or connections to applications on the VMs.  Have not been able to find any seemingly relevant messages in the logs on either ESXi hosts.

Times of events:

Oct 30, 3:07PM - disconnect from host 'b', automatically reconnected after a moment

Nov 6, 10:35PM - disconnect from host 'a', automatically reconnected after a moment

Nov 20, 8:00AM - disconnect from host 'a', automatically reconnected after a moment

The key errors that I've located are in the vCenter vpxd.log files.  Same pattern of messages for each occurrence (didn't see these messagea at any other time).

Does anyone have any ideas as to what service is connecting to what, or where I might look for further information regarding this issue?  I've not come up with any useful KB articles or discussion posts regarding these messages.

Any ideas would be appreciated!

2012-10-30T19:07:09.172Z [7FEFA22BD700 info 'commonvpxLro' opID=474CFF98-00005D8C-e] [VpxLRO] -- BEGIN task-internal-44909 --  -- vmodl.query.PropertyCollector.retrieveContents -- 98223be0-04ed-2fc5-75e4-e246662b83c5(52f148c6-b51f-212a-9f3f-613de90a774a)
2012-10-30T19:07:09.174Z [7FEFA22BD700 info 'commonvpxLro' opID=474CFF98-00005D8C-e] [VpxLRO] -- FINISH task-internal-44909 --  -- vmodl.query.PropertyCollector.retrieveContents --
2012-10-30T19:07:11.877Z [7FEFA22FE700 error 'HttpConnectionPool-000000'] [ConnectComplete] Connect failed to <cs p:00007fef9c692d60, TCP:hostname.fqdn.com:443>; cnx: (null), error: N7Vmacore15SystemExceptionE(Name or service not known)
2012-10-30T19:07:11.877Z [7FEFA31F3700 warning 'VpxProfiler' opID=HB-host-444@28464-c0785c1b] ClientAdapterBase::InvokeOnSoap: (hostname.fqdn.com, vpxapi.VpxaService.retrieveChanges) [SoapRpcTime] took 20014 ms
2012-10-30T19:07:11.877Z [7FEFA31F3700 error 'vpxdvpxdVmomi' opID=HB-host-444@28464-c0785c1b] [VpxdClientAdapter] Got vmacore exception: Name or service not known
2012-10-30T19:07:11.878Z [7FEFA31F3700 error 'vpxdvpxdVmomi' opID=HB-host-444@28464-c0785c1b] [VpxdClientAdapter] Backtrace:
--> backtrace[00] rip 00007fefa8297a84 Vmacore::System::Stacktrace::CaptureWork(unsigned int)
--> backtrace[01] rip 00007fefa8176ae2 Vmacore::System::SystemFactoryImpl::CreateQuickBacktrace(Vmacore::Ref<Vmacore::System::Backtrace>&)
--> backtrace[02] rip 00007fefa80d7da5 Vmacore::Throwable::Throwable(std::string const&)
--> backtrace[03] rip 00007fefa825fd3c Vmacore::SystemException::SystemException(int, std::string const&)
--> backtrace[04] rip 00007fefa82846dd /usr/lib/vmware-vpx/libvmacore.so(+0x3e66dd) [0x7fefa82846dd]
--> backtrace[05] rip 00007fefa8166bec /usr/lib/vmware-vpx/libvmacore.so(+0x2c8bec) [0x7fefa8166bec]
--> backtrace[06] rip 00007fefa82956f3 /usr/lib/vmware-vpx/libvmacore.so(+0x3f76f3) [0x7fefa82956f3]
--> backtrace[07] rip 00007fefa828a5c2 /usr/lib/vmware-vpx/libvmacore.so(+0x3ec5c2) [0x7fefa828a5c2]
--> backtrace[08] rip 00007fefa828599f /usr/lib/vmware-vpx/libvmacore.so(+0x3e799f) [0x7fefa828599f]
--> backtrace[09] rip 00007fefa6bff7b6 /lib64/libpthread.so.0(+0x77b6) [0x7fefa6bff7b6]
--> backtrace[10] rip 00007fefa62c8c6d /lib64/libc.so.6(clone+0x6d) [0x7fefa62c8c6d]
-->
2012-10-30T19:07:13.336Z [7FEFA31F3700 info 'vpxdvpxdVmomi' opID=HB-host-444@28464-c0785c1b] [ClientAdapterBase::InvokeOnSoap] Invoke done (hostname.fqdn.com, vmodl.query.PropertyCollector.cancelWaitForUpdates)
2012-10-30T19:07:13.337Z [7FEFA31F3700 warning 'VpxProfiler' opID=HB-host-444@28464-c0785c1b] [VpxdHostSync] GetChanges host:hostname.fqdn.com (192.168.1.187) [GetChangesTime] took 21474 ms

Reply
0 Kudos
15 Replies
pmcdonnell
Enthusiast
Enthusiast

Bump - anyone have any ideas?  Anywhere I should be looking for further log messages or info?

Any assistance is appreciated.  Thanks!

Reply
0 Kudos
MKguy
Virtuoso
Virtuoso

The vpxa agent on the host might be dying and automatically restarting, had this a few times here and there, nothing too serious.

Can you post the /var/log/hostd.log, /var/log/vpxa.log and /var/log/vmkernel.log from an ESXi host shortly after disconnecting?

-- http://alpacapowered.wordpress.com
Reply
0 Kudos
sparrowangelste
Virtuoso
Virtuoso

2012-10-30T19:07:13.337Z [7FEFA31F3700 warning 'VpxProfiler'  opID=HB-host-444@28464-c0785c1b] [VpxdHostSync] GetChanges  host:hostname.fqdn.com (192.168.1.187) [GetChangesTime] took 21474 ms

that is 21 seconds.

cehck to see if you dont have networking issues or duplicate ips

--------------------- Sparrowangelstechnology : Vmware lover http://sparrowangelstechnology.blogspot.com
Reply
0 Kudos
pmcdonnell
Enthusiast
Enthusiast

The vpxa logs seem to cycle very quickly - there are no logs available from even the most recent event, however I had looked at the vpxa log content around the time of the most recent event and saw no indications of any issues, certainly nothing as severe as a service restart.

These are the only vmkernel log entries from the 20th:

2012-11-20T18:42:12.714Z cpu0:2375778)WARNING: UserObj: 675: Failed to crossdup fd 8, /dev/ptyp0 type CHAR: Busy
2012-11-20T18:42:12.714Z cpu0:2375778)WARNING: UserObj: 675: Failed to crossdup fd 10, /dev/ptyp0 type CHAR: Busy
2012-11-20T18:42:12.715Z cpu1:2375781)WARNING: UserLinux: 1340: unsupported: (void)
2012-11-20T19:42:23.316Z cpu1:2378197)WARNING: UserObj: 675: Failed to crossdup fd 8, /dev/ptyp0 type CHAR: Busy
2012-11-20T19:42:23.316Z cpu1:2378197)WARNING: UserObj: 675: Failed to crossdup fd 10, /dev/ptyp0 type CHAR: Busy
2012-11-20T19:42:23.317Z cpu3:2378200)WARNING: UserLinux: 1340: unsupported: (void)

The log entries are UTC, the event was 8AM EST which would not match up with any of the log entries listed above.

hostd.log entries from the time in question are below. Hopefully you can see something of concern that I haven't!

I've looked up most of the messages listed and didn't find any particular correlation with the disconnect errors.

Any advice or ideas is very much appreciated!  Thanks!

2012-11-20T12:58:35.035Z [68481B90 verbose 'SoapAdapter'] Responded to service state request
2012-11-20T12:58:36.901Z [68D47B90 verbose 'vm:/vmfs/volumes/ebaa2080-6e7ba086/esxcenter/esxcenter.vmx'] Actual VM overhead: 270991360 bytes
2012-11-20T12:58:36.901Z [68D47B90 verbose 'Vmsvc'] RefreshVms updated overhead for 1 VM
2012-11-20T12:58:36.903Z [68D06B90 verbose 'Default'] Power policy is unset
2012-11-20T12:58:40.043Z [68F4FB90 verbose 'Statssvc'] HostCtl Exception in stats collection: Sysinfo error on operation returned status : Not found. Please see the VMkernel log for detailed error information
2012-11-20T12:58:40.043Z [68F4FB90 verbose 'Statssvc'] HostCtl Exception in stats collection.  Turn on 'trivia' log for details
2012-11-20T12:58:50.478Z [68FD0B90 verbose 'Default'] Power policy is unset
2012-11-20T12:59:00.043Z [68BC1B90 verbose 'Statssvc'] HostCtl Exception in stats collection: Sysinfo error on operation returned status : Not found. Please see the VMkernel log for detailed error information
2012-11-20T12:59:00.043Z [68BC1B90 verbose 'Statssvc'] HostCtl Exception in stats collection.  Turn on 'trivia' log for details
2012-11-20T12:59:05.036Z [682DCB90 verbose 'SoapAdapter'] Responded to service state request
2012-11-20T12:59:07.981Z [684C6B90 verbose 'Cimsvc'] Ticket issued for CIMOM version 1.0, user root
2012-11-20T12:59:13.066Z [68CC5B90 verbose 'Default'] Power policy is unset
2012-11-20T12:59:17.784Z [68D47B90 verbose 'DvsManager'] PersistAllDvsInfo called
2012-11-20T12:59:20.041Z [68F4FB90 verbose 'Statssvc'] HostCtl Exception in stats collection: Sysinfo error on operation returned status : Not found. Please see the VMkernel log for detailed error information
2012-11-20T12:59:20.042Z [68F4FB90 verbose 'Statssvc'] HostCtl Exception in stats collection.  Turn on 'trivia' log for details
2012-11-20T12:59:20.477Z [68F0EB90 verbose 'Default'] Power policy is unset
2012-11-20T12:59:35.040Z [68D06B90 verbose 'SoapAdapter'] Responded to service state request
2012-11-20T12:59:36.907Z [6829BB90 verbose 'vm:/vmfs/volumes/ebaa2080-6e7ba086/XXXXXX_Appliance_Server/XXXXXX_Appliance_Server.vmx'] Actual VM overhead: 142921728 bytes
2012-11-20T12:59:36.907Z [6829BB90 verbose 'Vmsvc'] RefreshVms updated overhead for 1 VM
2012-11-20T12:59:36.909Z [67C40B90 verbose 'Default'] Power policy is unset
2012-11-20T12:59:40.043Z [68FD0B90 verbose 'Statssvc'] HostCtl Exception in stats collection: Sysinfo error on operation returned status : Not found. Please see the VMkernel log for detailed error information
2012-11-20T12:59:40.043Z [68FD0B90 verbose 'Statssvc'] HostCtl Exception in stats collection.  Turn on 'trivia' log for details
2012-11-20T12:59:43.065Z [68CC5B90 verbose 'Default'] Power policy is unset
2012-11-20T12:59:50.476Z [68ECDB90 verbose 'Default'] Power policy is unset
2012-11-20T13:00:00.044Z [681DAB90 verbose 'Statssvc'] HostCtl Exception in stats collection: Sysinfo error on operation returned status : Not found. Please see the VMkernel log for detailed error information
2012-11-20T13:00:00.044Z [681DAB90 verbose 'Statssvc'] HostCtl Exception in stats collection.  Turn on 'trivia' log for details
2012-11-20T13:00:05.041Z [68E09B90 verbose 'SoapAdapter'] Responded to service state request
2012-11-20T13:00:13.066Z [68ECDB90 verbose 'Default'] Power policy is unset
2012-11-20T13:00:14.046Z [67C40B90 verbose 'Default'] Power policy is unset
2012-11-20T13:00:14.635Z [68C84B90 verbose 'Proxysvc Req45582'] New proxy client SSL(TCP(local=192.168.1.188:443, peer=192.168.1.185:45876))
2012-11-20T13:00:14.675Z [68C43B90 verbose 'Locale' opID=SWI-f059dce6-8f] Default resource used for 'counter.virtualDisk.commandsAborted.label' expected in module 'perf'.
2012-11-20T13:00:14.675Z [68C43B90 verbose 'Locale' opID=SWI-f059dce6-8f] Default resource used for 'counter.virtualDisk.commandsAborted.summary' expected in module 'perf'.
2012-11-20T13:00:14.675Z [68C43B90 verbose 'Locale' opID=SWI-f059dce6-8f] Default resource used for 'counter.virtualDisk.busResets.label' expected in module 'perf'.
2012-11-20T13:00:14.675Z [68C43B90 verbose 'Locale' opID=SWI-f059dce6-8f] Default resource used for 'counter.virtualDisk.busResets.summary' expected in module 'perf'.
2012-11-20T13:00:16.790Z [684C6B90 verbose 'Proxysvc Req45583'] New proxy client SSL(TCP(local=192.168.1.188:443, peer=192.168.1.185:45877))
2012-11-20T13:00:20.044Z [6829BB90 verbose 'Statssvc'] HostCtl Exception in stats collection: Sysinfo error on operation returned status : Not found. Please see the VMkernel log for detailed error information
2012-11-20T13:00:20.044Z [6829BB90 verbose 'Statssvc'] HostCtl Exception in stats collection.  Turn on 'trivia' log for details
2012-11-20T13:00:20.476Z [68FD0B90 verbose 'Default'] Power policy is unset
2012-11-20T13:00:23.754Z [68BC1B90 verbose 'Locale' opID=HB-host-422@14206-c05314b9-ed] Default resource used for 'counter.virtualDisk.commandsAborted.label' expected in module 'perf'.
2012-11-20T13:00:23.754Z [68BC1B90 verbose 'Locale' opID=HB-host-422@14206-c05314b9-ed] Default resource used for 'counter.virtualDisk.commandsAborted.summary' expected in module 'perf'.
2012-11-20T13:00:23.754Z [68BC1B90 verbose 'Locale' opID=HB-host-422@14206-c05314b9-ed] Default resource used for 'counter.virtualDisk.busResets.label' expected in module 'perf'.
2012-11-20T13:00:23.754Z [68BC1B90 verbose 'Locale' opID=HB-host-422@14206-c05314b9-ed] Default resource used for 'counter.virtualDisk.busResets.summary' expected in module 'perf'.
2012-11-20T13:00:23.780Z [68B80B90 verbose 'ha-license-manager' opID=HB-host-422@14206-c05314b9-ed] Load: Loading existing file: /etc/vmware/license.cfg
2012-11-20T13:00:23.799Z [68B80B90 verbose 'Default' opID=HB-host-422@14206-c05314b9-ed] ha-license-manager:Validate -> Valid license detected for "VMware ESX Server 5.0" (lastError=0, desc.IsValid:Yes)
2012-11-20T13:00:23.808Z [68B80B90 error 'Default' opID=HB-host-422@14206-c05314b9-ed] Unable to parse MaxRam value:
2012-11-20T13:00:23.808Z [68B80B90 error 'Default' opID=HB-host-422@14206-c05314b9-ed] Unable to parse MaxRamPerCpu value:
2012-11-20T13:00:23.808Z [68B80B90 error 'Default' opID=HB-host-422@14206-c05314b9-ed] Unable to parse MinRamPerCpu value:
2012-11-20T13:00:23.935Z [67C40B90 verbose 'Locale' opID=HB-host-422@14206-c05314b9-a4] Default resource used for 'counter.virtualDisk.commandsAborted.label' expected in module 'perf'.
2012-11-20T13:00:23.935Z [67C40B90 verbose 'Locale' opID=HB-host-422@14206-c05314b9-a4] Default resource used for 'counter.virtualDisk.commandsAborted.summary' expected in module 'perf'.
2012-11-20T13:00:23.935Z [67C40B90 verbose 'Locale' opID=HB-host-422@14206-c05314b9-a4] Default resource used for 'counter.virtualDisk.busResets.label' expected in module 'perf'.
2012-11-20T13:00:23.935Z [67C40B90 verbose 'Locale' opID=HB-host-422@14206-c05314b9-a4] Default resource used for 'counter.virtualDisk.busResets.summary' expected in module 'perf'.
2012-11-20T13:00:25.080Z [68FD0B90 info 'TaskManager' opID=HB-host-422@14206-c05314b9-SWI-b5f5412a-bc] Task Created : haTask--vim.dvs.HostDistributedVirtualSwitchManager.applyDvsList-238054744
2012-11-20T13:00:25.081Z [68D88B90 info 'TaskManager' opID=HB-host-422@14206-c05314b9-SWI-b5f5412a-bc] Task Completed : haTask--vim.dvs.HostDistributedVirtualSwitchManager.applyDvsList-238054744 Status success
2012-11-20T13:00:33.756Z [68D47B90 verbose 'Proxysvc Req45584'] New proxy client SSL(TCP(local=192.168.1.188:443, peer=192.168.1.185:45895))
2012-11-20T13:00:33.794Z [68FD0B90 verbose 'Locale' opID=HB-host-422@14207-d08d0fdb-bf] Default resource used for 'counter.virtualDisk.commandsAborted.label' expected in module 'perf'.
2012-11-20T13:00:33.794Z [68FD0B90 verbose 'Locale' opID=HB-host-422@14207-d08d0fdb-bf] Default resource used for 'counter.virtualDisk.commandsAborted.summary' expected in module 'perf'.
2012-11-20T13:00:33.794Z [68FD0B90 verbose 'Locale' opID=HB-host-422@14207-d08d0fdb-bf] Default resource used for 'counter.virtualDisk.busResets.label' expected in module 'perf'.
2012-11-20T13:00:33.794Z [68FD0B90 verbose 'Locale' opID=HB-host-422@14207-d08d0fdb-bf] Default resource used for 'counter.virtualDisk.busResets.summary' expected in module 'perf'.
2012-11-20T13:00:33.821Z [68BC1B90 verbose 'ha-license-manager' opID=HB-host-422@14207-d08d0fdb-bf] Load: Loading existing file: /etc/vmware/license.cfg
2012-11-20T13:00:33.840Z [68BC1B90 verbose 'Default' opID=HB-host-422@14207-d08d0fdb-bf] ha-license-manager:Validate -> Valid license detected for "VMware ESX Server 5.0" (lastError=0, desc.IsValid:Yes)
2012-11-20T13:00:33.849Z [68BC1B90 error 'Default' opID=HB-host-422@14207-d08d0fdb-bf] Unable to parse MaxRam value:
2012-11-20T13:00:33.849Z [68BC1B90 error 'Default' opID=HB-host-422@14207-d08d0fdb-bf] Unable to parse MaxRamPerCpu value:
2012-11-20T13:00:33.849Z [68BC1B90 error 'Default' opID=HB-host-422@14207-d08d0fdb-bf] Unable to parse MinRamPerCpu value:
2012-11-20T13:00:33.930Z [68F0EB90 verbose 'Locale' opID=HB-host-422@14207-d08d0fdb-bc] Default resource used for 'counter.virtualDisk.commandsAborted.label' expected in module 'perf'.
2012-11-20T13:00:33.930Z [68F0EB90 verbose 'Locale' opID=HB-host-422@14207-d08d0fdb-bc] Default resource used for 'counter.virtualDisk.commandsAborted.summary' expected in module 'perf'.
2012-11-20T13:00:33.930Z [68F0EB90 verbose 'Locale' opID=HB-host-422@14207-d08d0fdb-bc] Default resource used for 'counter.virtualDisk.busResets.label' expected in module 'perf'.
2012-11-20T13:00:33.930Z [68F0EB90 verbose 'Locale' opID=HB-host-422@14207-d08d0fdb-bc] Default resource used for 'counter.virtualDisk.busResets.summary' expected in module 'perf'.
2012-11-20T13:00:35.044Z [67F03B90 verbose 'SoapAdapter'] Responded to service state request
2012-11-20T13:00:36.911Z [68BC1B90 verbose 'vm:/vmfs/volumes/ebaa2080-6e7ba086/Symposium - Discussion Server/Symposium - Discussion Server.vmx'] Actual VM overhead: 55209984 bytes
2012-11-20T13:00:36.913Z [68C02B90 verbose 'Default'] Power policy is unset
2012-11-20T13:00:36.914Z [68BC1B90 verbose 'Vmsvc'] RefreshVms updated overhead for 1 VM
2012-11-20T13:00:39.431Z [68481B90 verbose 'Default'] Power policy is unset
2012-11-20T13:00:40.044Z [67F03B90 verbose 'Statssvc'] HostCtl Exception in stats collection: Sysinfo error on operation returned status : Not found. Please see the VMkernel log for detailed error information
2012-11-20T13:00:40.044Z [67F03B90 verbose 'Statssvc'] HostCtl Exception in stats collection.  Turn on 'trivia' log for details
2012-11-20T13:00:43.065Z [67F03B90 verbose 'Default'] Power policy is unset
2012-11-20T13:00:43.754Z [682DCB90 verbose 'Locale' opID=HB-host-422@14210-9a9742c1-48] Default resource used for 'counter.virtualDisk.commandsAborted.label' expected in module 'perf'.
2012-11-20T13:00:43.754Z [682DCB90 verbose 'Locale' opID=HB-host-422@14210-9a9742c1-48] Default resource used for 'counter.virtualDisk.commandsAborted.summary' expected in module 'perf'.
2012-11-20T13:00:43.754Z [682DCB90 verbose 'Locale' opID=HB-host-422@14210-9a9742c1-48] Default resource used for 'counter.virtualDisk.busResets.label' expected in module 'perf'.
2012-11-20T13:00:43.754Z [682DCB90 verbose 'Locale' opID=HB-host-422@14210-9a9742c1-48] Default resource used for 'counter.virtualDisk.busResets.summary' expected in module 'perf'.
2012-11-20T13:00:43.755Z [68BC1B90 verbose 'Proxysvc Req45585'] New proxy client SSL(TCP(local=192.168.1.188:443, peer=192.168.1.185:45896))
2012-11-20T13:00:43.782Z [681DAB90 verbose 'ha-license-manager' opID=HB-host-422@14210-9a9742c1-48] Load: Loading existing file: /etc/vmware/license.cfg
2012-11-20T13:00:43.801Z [681DAB90 verbose 'Default' opID=HB-host-422@14210-9a9742c1-48] ha-license-manager:Validate -> Valid license detected for "VMware ESX Server 5.0" (lastError=0, desc.IsValid:Yes)
2012-11-20T13:00:43.810Z [681DAB90 error 'Default' opID=HB-host-422@14210-9a9742c1-48] Unable to parse MaxRam value:
2012-11-20T13:00:43.810Z [681DAB90 error 'Default' opID=HB-host-422@14210-9a9742c1-48] Unable to parse MaxRamPerCpu value:
2012-11-20T13:00:43.810Z [681DAB90 error 'Default' opID=HB-host-422@14210-9a9742c1-48] Unable to parse MinRamPerCpu value:
2012-11-20T13:00:44.046Z [68D88B90 verbose 'Default'] Power policy is unset
2012-11-20T13:00:45.022Z [67746CC0 verbose 'Default'] Power policy is unset
2012-11-20T13:00:50.475Z [68E4BB90 verbose 'Default'] Power policy is unset
2012-11-20T13:01:00.042Z [68BC1B90 verbose 'Statssvc'] HostCtl Exception in stats collection: Sysinfo error on operation returned status : Not found. Please see the VMkernel log for detailed error information
2012-11-20T13:01:00.042Z [68BC1B90 verbose 'Statssvc'] HostCtl Exception in stats collection.  Turn on 'trivia' log for details
2012-11-20T13:01:05.045Z [68BC1B90 verbose 'SoapAdapter'] Responded to service state request
2012-11-20T13:01:20.044Z [681DAB90 verbose 'Statssvc'] HostCtl Exception in stats collection: Sysinfo error on operation returned status : Not found. Please see the VMkernel log for detailed error information
2012-11-20T13:01:20.044Z [681DAB90 verbose 'Statssvc'] HostCtl Exception in stats collection.  Turn on 'trivia' log for details
2012-11-20T13:01:20.474Z [68FD0B90 verbose 'Default'] Power policy is unset
2012-11-20T13:01:33.203Z [68F0EB90 verbose 'Cimsvc'] Ticket issued for CIMOM version 1.0, user root
2012-11-20T13:01:35.046Z [68E09B90 verbose 'SoapAdapter'] Responded to service state request

Reply
0 Kudos
pmcdonnell
Enthusiast
Enthusiast

The first occurrence was with an ESXi host that the VCVA was not running on, so there was data transfer through a network connection, however I was not able to find any evidence of a network disconnect or interrupt whatsoever.

The next two occurrences were the VCVA detecting a disconnect of the host on which the VCVA itself was running!  To me that was even more unusual and perplexing.

I have not been able to detect any IP conflicts.  It is a pretty small network and we keep good track of IP allocation, so I don't believe it is a conflict.

Please let me know if you have any other ideas - any tips or info is very much appreciated!

Thanks!

Reply
0 Kudos
MKguy
Virtuoso
Virtuoso

I can't spot anything wrong with the log sections you provided. Reviewing the vpxa.log would still be interesting though. If the vpxa.log cycles too quickly, grab the gzipped older logs from /scratch/log/vpxa*.

Just in case, can you run a continuous ping between VCSA and a host to make fully sure that network connectivity even on the same host is not a problem?

-- http://alpacapowered.wordpress.com
pmcdonnell
Enthusiast
Enthusiast

I'll start timestampped pings from the VCVA to both ESXi hosts (and maybe a couple of other systems) and will check daily for any anomolies.

I have looked through all the gzipped vpxa logs in /scratch/log and they are all from today - they seem to be rotating about once per hour or two.  The majority of the messages (about 2/3 of each log) appear to be similar to the following:

2012-11-23T15:32:07.179Z [67C35B90 verbose 'Default' opID=HB-host-422@18757-5e4cc9cb-50] [VpxaStatsMetadata] Mapping VpxdCounter 121 to hostd counter -2147483593
2012-11-23T15:32:07.179Z [67C35B90 verbose 'Default' opID=HB-host-422@18757-5e4cc9cb-50] [VpxaStatsMetadata] Mapping VpxdCounter 108 to hostd counter -2147483592
2012-11-23T15:32:07.179Z [67C35B90 verbose 'Default' opID=HB-host-422@18757-5e4cc9cb-50] [VpxaStatsMetadata] Mapping VpxdCounter 110 to hostd counter -2147483591

I'll report back on Monday with the results over the weekend of the ping test.

Thanks again!

Reply
0 Kudos
pmcdonnell
Enthusiast
Enthusiast

It happened again!  Last evening at 22:52:30 We recevied two alarm messages:

Target: hostname.example.com
Stateless event alarm

Alarm Definition:
([Event alarm expression: Cannot connect host - network error] OR [Event alarm expression: Cannot connect host - time-out] OR [Event alarm expression: Cannot connect host - time-out] OR [Event alarm expression: Host connection lost])

Event details:
Host hostname.example.com in mygroup is not responding

Same issue in the vpxd.log:

2012-11-27T03:52:27.895Z [7FEFA3453700 info 'vpxdvpxdVmomi' opID=SWI-4e6b857] [ClientAdapterBase::InvokeOnSoap] Invoke done (hostname.example.com, vpxapi.VpxaService.fetchQuickStats)
2012-11-27T03:52:27.895Z [7FEFA3453700 info 'vpxdvpxdVmomi' opID=SWI-4e6b857] [ClientAdapterBase::InvokeOnSoap] Invoke done (hostname.example.com, vpxapi.VpxaService.fetchQuickStats)
2012-11-27T03:52:29.378Z [7FEFA3275700 error 'HttpConnectionPool-000000'] [ConnectComplete] Connect failed to <cs p:00007fef9c692d60, TCP:hostname.example.com:443>; cnx: (null), error: N7Vmacore15SystemExceptionE(Name or service not known)
2012-11-27T03:52:29.379Z [7FEFA227C700 error 'vpxdvpxdVmomi'] [VpxdClientAdapter] Got vmacore exception: Name or service not known
2012-11-27T03:52:29.380Z [7FEFA227C700 error 'vpxdvpxdVmomi'] [VpxdClientAdapter] Backtrace:
--> backtrace[00] rip 00007fefa8297a84 Vmacore::System::Stacktrace::CaptureWork(unsigned int)
--> backtrace[01] rip 00007fefa8176ae2 Vmacore::System::SystemFactoryImpl::CreateQuickBacktrace(Vmacore::Ref<Vmacore::System::Backtrace>&)
--> backtrace[02] rip 00007fefa80d7da5 Vmacore::Throwable::Throwable(std::string const&)
--> backtrace[03] rip 00007fefa825fd3c Vmacore::SystemException::SystemException(int, std::string const&)
--> backtrace[04] rip 00007fefa82846dd /usr/lib/vmware-vpx/libvmacore.so(+0x3e66dd) [0x7fefa82846dd]
--> backtrace[05] rip 00007fefa8166bec /usr/lib/vmware-vpx/libvmacore.so(+0x2c8bec) [0x7fefa8166bec]
--> backtrace[06] rip 00007fefa82956f3 /usr/lib/vmware-vpx/libvmacore.so(+0x3f76f3) [0x7fefa82956f3]
--> backtrace[07] rip 00007fefa828a5c2 /usr/lib/vmware-vpx/libvmacore.so(+0x3ec5c2) [0x7fefa828a5c2]
--> backtrace[08] rip 00007fefa828599f /usr/lib/vmware-vpx/libvmacore.so(+0x3e799f) [0x7fefa828599f]
--> backtrace[09] rip 00007fefa6bff7b6 /lib64/libpthread.so.0(+0x77b6) [0x7fefa6bff7b6]
--> backtrace[10] rip 00007fefa62c8c6d /lib64/libc.so.6(clone+0x6d) [0x7fefa62c8c6d]
-->
2012-11-27T03:52:29.389Z [7FEFA3494700 info 'vpxdvpxdVmomi' opID=HB-host-444@53107-ce300c74] [ClientAdapterBase::InvokeOnSoap] Invoke done (hostname.example.com, vmodl.query.PropertyCollector.Filter.destroy)
2012-11-27T03:52:29.389Z [7FEFA3494700 warning 'VpxProfiler' opID=HB-host-444@53107-ce300c74] [VpxdHostSync] GetChanges host:hostname.example.com (192.168.1.187) [GetChangesTime] took 20023 ms
2012-11-27T03:52:29.389Z [7FEFA3494700 warning 'VpxProfiler' opID=HB-host-444@53107-ce300c74] [VpxdHostSync] DoHostSync:0x7fefb41e0900 [DoHostSyncTime] took 20024 ms

The timestamped ping logs I had running from my vCenter server to both ESXi hosts as well as a few other physical servers were perfectly normal - no timeouts, delays or unusual responses whatsoever during this time.

I was able to gather the vpxa.log and hostd.log info from the ESXi host that vCenter claims it couldn't contact.  I saved the entire log and the events for the minute that this issue occurred are in the attached file.  I've only replaced the hostnames and VM names where needed.

If I have reviewed the logs correctly, it looks like the vpxa service may have restarted, but I think there were still successful 'Soap' requests and responses?

Is it because the response was delayed by too long?  If so, are there any typical causes for this?

I also note that a 'DoHostSyncTime' took over 20s in the last message above - Is this an internal VMware time sync routine?  Might this be relevant at all?

Thanks again for any input!!!

Reply
0 Kudos
MKguy
Virtuoso
Virtuoso

2012-11-27T03:52:31.522Z [6277DB90 verbose 'Default' opID=HB-host-444@53107-dc80bb14-9a] [VpxaMoService::GetChangesInt] Vpxa restarted or stolen by other server. Start a full sync

It indeed does look to have been restarted, but I'd expect some more "distinct" messages or can only guess as to what happened. It's probably time for a SR.

This link suggests increasing the vpxd heartbeat timeout value:

http://h10025.www1.hp.com/ewfrf/wc/document?cc=us&lc=en&dlc=en&docname=c03241064

Some with a similar issue suggested disabling the ESXi host firewall:

http://communities.vmware.com/thread/389935?start=15&tstart=0

http://nutanix.blogspot.de/2012/06/vcenter-disconnects-due-to-apd-or.html

-- http://alpacapowered.wordpress.com
pmcdonnell
Enthusiast
Enthusiast

MKguy, thank you very much for your reply!

Based on the HP article about the heartbeat timeout change, I did a bit of further reading.  The VMware KB article(1005757) referenced in the HP document describes a 60 second timeout, however I found another web page (admittedly much older) that describes a 20 second timeout (http://theregime.wordpress.com/2010/02/03/increase-esx-heartbeat-to-vcenter/).  I think a 20 second timeout fits with the behaviour we're seeing - two of the delays this time (GetChangesTime and DoHostSyncTime) took just 23/24ms respectively longer than 20 seconds.  Perhaps that led to the 'disconnected' condition.

I trying to think of what could cause anything to take 20 seconds between these systems, a common DNS server came to light.  This system is running on old hardware and is short on resources, including RAM.  I have a record of it taking up to 18s to respond to a ping on occasion.  The theory I'm thinking of is that if a DNS query took 15 or 18+ seconds, that could contribute to a delay that might take a response over the 20s limit (if it is 20s).

The initial action we're going to take is to upgrade the RAM in the DNS server to give it more resources to work with to hopefully reduce the occasions where the response times are very high.  Failing that we will most likely move on with changing the timeout.  I'll look for any patterns of how long those checks have taken based on the vpxd.log before and after we make the RAM upgrade to see if it has made a noticable difference.

Thanks again!  Will report back/assign answered marks when we know if this is resolved!

Reply
0 Kudos
RParker
Immortal
Immortal

I have seen this before..everytime it's an IP conflict.  Some other server is trying to use the same IP as your host.

Reply
0 Kudos
RParker
Immortal
Immortal

pmcdonnell wrote:

I trying to think of what could cause anything to take 20 seconds between these systems, a common DNS server came to light.  This system is running on old hardware and is short on resources, including RAM.  I have a record of it taking up to 18s to respond to a ping on occasion.  The theory I'm thinking of is that if a DNS query took 15 or 18+ seconds, that could contribute to a delay that might take a response over the 20s limit (if it is 20s).

Your assessment of DNS operation is flawed.  It doesn't matter how slow or old a machine is, DNS doesn't need to refresh for it to work.  If what you are describing is true *ALL* servers would suffer the same problem, not just one or two records.  Therefore DNS isn't the issue.

Once a record is in DNS, it's there period. It doesn't delete all records, then refresh and update the table again, they are there until you delete them.  DNS is also a very simple server, ALL it does is respond to request.  CPU is very low.  You can run DNS on an old 386 machine with 64 meg of RAM (provided Windows can run that small) with no issue.  DNS needs almost zero resources to function. 

I believe you have a network or IP conflict issue.

Reply
0 Kudos
pmcdonnell
Enthusiast
Enthusiast

While performing maintenance on the DNS server, we recevied disconnect messages.  There have been two other incidents where  it has had very slow response times and again triggered disconnect alarms.

The DNS server is scheduled for replacement, but in the interm I have implemented hosts file entries for the vCenter/ESXi systems.  I performed testing and even when the DNS server was inaccessible there was no disconnect error.  I believe this is a reasonable workaround for the disconnect issue until our DNS server replacment is completed.

I've looked for duplicate IP addresses by reviewing the config of all systems on the network and reviewing arp info - no duplicates were detected.

I will post again after a week in this configuration if we can confirm that the issuer has not come up again.

Thanks to all for the input and ideas!

Reply
0 Kudos
pmcdonnell
Enthusiast
Enthusiast

There have been no further issues so I will consider this issue resolve.

Thanks for the ideas everyone!

Reply
0 Kudos
craigclem10
Contributor
Contributor

Never forget Windows firewall either. I had a vCenter server drop a host, when another was added. Turned out it was Windows Firewall (enforced by GPO), and the vCenter server had been placed in a wrong OU (client OU) which limited the TCP connections.

Reply
0 Kudos