VMware Cloud Community
hpuxman
Contributor
Contributor

Wait for OS Heartbeat with SRM 4.0 - lenghty recovery test

Hi everyone:

I've noticed that when performing test failovers I'm getting pretty lengthy time Waiting for OS Heartbeat, about 7-9 minutes. If I open a console and actually login the vm when it comes up during the test, the Waiting for OS Heartbeat drops to half.

I searched the KBs and there was an article about SRM 1.0 with changing a parameter on ESX 3.5, but that does not exist on vSphere 4.0U1 hosts.

I've also tried setting time to 0, but then get a timeout error.

Any way to 'tune' the Waiting for OS Heartbeat parameter?

0 Kudos
6 Replies
tobiashansen
Contributor
Contributor

You can modify Wait for OS hearbeat in the specific recovery plan.

Right click your plan

Click next 2 time to get to "Response times"

On that page you have "wai for OS heartbeat" value that is default set to 600 seconds.

0 Kudos
hegars
Contributor
Contributor

We have the same issue here (vSphere v4.0) while waiting for the OS Heartbeat and tried halfing the timeout settings to 300secs.

Changing the Timeout setting does not solve the problem, but if we log onto the recovered VM while the Recovery Plan is running, then the required services start and the RecoverPlan completes "all green" as would be desired.

So a combination of shortening the Timeout setting and logging onto the recovered VM before the Recovery Plan finishes, "solves" our problem.

Steve

0 Kudos
pauljawood
Enthusiast
Enthusiast

Hi,

I have seen this before but can you please confirm that the vmtools are up to date on the vm's. You could also try setting the OS heartbeat wait to 0 if you have faith that the machine will be running. This can also be done of the network change.

-


If you found this helpful then please leave some points.

If you found this helpful then please leave some points.
0 Kudos
hegars
Contributor
Contributor

Thanks.

The VMtools are all update date.

I think the problem is related to using the Test network for the Test Recovery. The VM cannot log into the domain as the DC is on another network. If we open the console on the recovered VM qand log into the VM locally while the recovery process is going, then it will pass fine.

This timeout does not occur during a real Recovery as the recovered VM has a DC available toit.

Steve

0 Kudos
pauljawood
Enthusiast
Enthusiast

Hi,

Are you using the 'Auto' network or one that you have setup. I have not seen the issue you have when using the testbubble switch.

-


If you found this helpful then please leave some points.

If you found this helpful then please leave some points.
0 Kudos
rnourse
Enthusiast
Enthusiast

The issue can be resolved with the fix documented in kb article 1008059. Yes, I know the article suggests this problem is unique to ESX 3.x and SRM 1 but we have seen the same issues under vSphere and SRM4. Edit the /etc/vmware/hostd/config.xml file and if the following stanza does not exist, create it (we had to).

<vmsvc>

<heartbeatDelayInSecs>0</heartbeatDelayInSecs>

<enabled>true</enabled>

</vmsvc>

Restart the management agents and test again... there should be no more timeout issues. Note the kb article has a typo and says to set thedelay to zero but then the example sets it to 40... do not set it to 40 or you'll still have the same issue.

0 Kudos