virtualtacit
Contributor
Contributor

OS heartbeat timeouts during test and actual recovery

Jump to solution

Anyone know of how to prevent the OS heartbeat timeouts from occuring within your recovery plan? presumably this is communication between tools and VC. If it always fails, which is my case, then my timeout values should be set small enough to lessen my test and recovery time. As it seems that half the time is spent waiting for this process to timeout. If the above is assumed and you are operating within a test network bubble then why not make the OS heartbeats a recovery step only. Again I just need clarification on what this is actually doing. Thanks in advance.

Tags (3)
0 Kudos
1 Solution

Accepted Solutions
Smoggy
VMware Employee
VMware Employee

i guess the first thing to say is that these timeouts are basically just warnings and not errors so can in nearly all cases be ignored but lets face it they don't look pretty especially as red errors stand out Smiley Happy

I suspect you have just updated to ESX/VC U3? if so then I think there was a change made that adjusts the frequency we use to check for vmtools heartbeats. this new value now means that SRM recovery plans miss the first "check" and have to wait for a second chance by which time the timeout warning has been logged.

if the errors really are annonying then you can if you wish (insert disclaimer here for stating this is just my suggestion and not a VMware suggestion) simply edit your ESX hostd config.xml (/etc/vmware/hostd/config.xml) on your recovery site ESX hosts so that the vmsvc section looks like this:

<vmsvc>

<enabled>true</enabled>

<heartbeatDelayInSecs>40</heartbeatDelayInSecs>

</vmsvc>

The value of 40 is just my choice you can use whatever you want. I think the previous default was 20 seconds but that has now been changed via the ESX U3 code. Got a feeling the ESX folks will be issuing a kb article on this. Once that change is made just run following command on service console (or restart ESX whichever is easier Smiley Happy )

  1. service mgmt-vmware restart

Then run the recovery plan again, no need to restart VC / SRM, and this time the plan ran through to completion without hitting any VMTools timeout errors/warnings. Note: in your recovery plan you should just be able to have the tools timeout values set to their defaults or old defaults of 30 seconds / 300 seconds

best regards,

Lee Dilworth

View solution in original post

0 Kudos
4 Replies
Smoggy
VMware Employee
VMware Employee

i guess the first thing to say is that these timeouts are basically just warnings and not errors so can in nearly all cases be ignored but lets face it they don't look pretty especially as red errors stand out Smiley Happy

I suspect you have just updated to ESX/VC U3? if so then I think there was a change made that adjusts the frequency we use to check for vmtools heartbeats. this new value now means that SRM recovery plans miss the first "check" and have to wait for a second chance by which time the timeout warning has been logged.

if the errors really are annonying then you can if you wish (insert disclaimer here for stating this is just my suggestion and not a VMware suggestion) simply edit your ESX hostd config.xml (/etc/vmware/hostd/config.xml) on your recovery site ESX hosts so that the vmsvc section looks like this:

<vmsvc>

<enabled>true</enabled>

<heartbeatDelayInSecs>40</heartbeatDelayInSecs>

</vmsvc>

The value of 40 is just my choice you can use whatever you want. I think the previous default was 20 seconds but that has now been changed via the ESX U3 code. Got a feeling the ESX folks will be issuing a kb article on this. Once that change is made just run following command on service console (or restart ESX whichever is easier Smiley Happy )

  1. service mgmt-vmware restart

Then run the recovery plan again, no need to restart VC / SRM, and this time the plan ran through to completion without hitting any VMTools timeout errors/warnings. Note: in your recovery plan you should just be able to have the tools timeout values set to their defaults or old defaults of 30 seconds / 300 seconds

best regards,

Lee Dilworth

View solution in original post

0 Kudos
virtualtacit
Contributor
Contributor

thanks lee I will give it a try today.

0 Kudos
virtualtacit
Contributor
Contributor

thanks lee this worked like a charm, and thanks for starting the "Uptime" blog, good stuff.

0 Kudos
justin_emerson
Enthusiast
Enthusiast

Lee,

Thank you, this fixed my problem as well. Is this a problem in SRM or in ESX Update 3? If so, will either of these two products be changed soon to fix this? This seems like kind of a big deal.

0 Kudos