VMware Communities > Blogs > Knorrhane > Tags

Blog Posts

Knorrhane

1 Posts tagged with the information tag
0

How does the HA (High Availability) feature work?
VMware HA continuously monitors all ESX Server hosts in a cluster and detects failures. An agent placed on each host maintains a "heartbeat" with the other hosts in the cluster and loss of a heartbeat initiates the process of restarting all affected virtual machines on other hosts. You create and manage clusters using VirtualCenter. The VirtualCenter Management Server places an agent on each host in the cluster so each host can communicate with other hosts to maintain state information and know what to do in case of another host's failure. (The VirtualCenter Management Server does not provide a single point of failure.) If the VirtualCenter Management Server host goes down, HA functionality changes as follows. HA clusters can still restart virtual machines on other hosts in case of failure; however, the information about what extra resources are available will be based on the state of the cluster before the VirtualCenter Management Server went down. HA monitors whether sufficient resources are available in the cluster at all times in order to be able to restart virtual machines on different physical host machines in the event of host failure. Safe restart of virtual machines is made possible by the locking technology in the ESX Server storage stack, which allows multiple ESX Servers to have access to the same virtual machines file simultaneously.

Host failure detection occurs 15 seconds after the HA service on a host has stopped sending heartbeats to the other hosts in the cluster. A host stops sending heartbeats if it is isolated from the network. At that time, other hosts in the cluster treat this host as failed, while this host declares itself as isolated from the network. By default, the isolated host powers off its virtual machines. These virtual machines can then successfully fail over to other hosts in the cluster. If the isolated host has SAN access, it retains the disk lock on the virtual machine files, and attempts to fail over the virtual machine to another host fails. The virtual machine continues to run on the isolated host. VMFS disk locking prevents simultaneous write operations to the virtual machine disk files and potential corruption.

If the network connection is restored before 12 seconds have elapsed, other hosts in the cluster will not treat this as a host failure. In addition, the host with the transient network connection problem does not declare itself isolated from the network and continues running. In the window between 12 and 14 seconds, the clustering service on the isolated host declares itself as isolated and starts powering off virtual machines with default isolation response settings. If the network connection is restored during that time, the virtual machine that had been powered off is not restarted on other hosts because the HA services on the other hosts do not consider this host as failed yet. As a result, if the network connection is restored in this window between 12 and 14 seconds after the host has lost connectivity, the virtual machines are powered off but not failed over.

http://vmware-land.com/Vmware_Tips.html#VC8


Troubleshooting HA

IP Connectivity
DNS resolution
Ensure storage and networks are visible throughout the cluster.
No user should manage the hosts by bypassing VC and tweaking
resource reservations.
Causes state to go to red
Check logs:
/opt/LGTOaam512/log/*
/opt/LGTOaam512/vmsupport/*

Configuring HA failed" or "while using HA, the vm did not failover".
Size of Fully Qualified Domain Name (FQDN) or short host name.
Workaround:
If the host short name is more than 29 characters, change the
HOSTNAME entry in /etc/sysconfig/network to the shorter name.
If using an FQDN that is greater than 29 characters:


  • Change the FQDN to less than or equal to 29 characters.
  • Remove the existing cluster.
  • Create a new cluster.

• Add all the hosts back to the cluster.

HA Configuration Fails
Check DNS, FQDN
You have just added a new host to the cluster

  • Check /opt/LGTOaam512/log/aam_config_util_addnode.log
  • /var/log/vmware/vpx/vpxa.log
  • In VC, right click on the host that shows the HA problem and click

reconfigure for HA.

  • Were all the hosts responding?
  • If not -new host cannot communicate with any of the primary hosts.
  • Solution:
  • Disconnect all the hosts that are not responding before you can add the new host.
  • The new host becomes the first primary host.
  • When the other hosts become available again, their HA service is
    reconfigured.
  • Most issues are due to DNS issues.
    Limit of 29 chars for the host name and DNS suffix
    Enter Fully Qualified Domain Name
    Each host in an HA cluster must be able to resolve the host name and
    IP address of all other hosts in the cluster.
    • Set up DNS on each host
    • Recommended: Edit the /etc/hosts file to provide redundancy in
  • case DNS lookups fail (documentation discourages it).
    • Proper way-- Edit the nsswitch.conf file and change the hosts line
  • to read: "hosts: dns files".HA Trouble Shooting - The Usual Suspects
    Check for DNS
    Make sure you can resolve the short hostname (without domain
    name) of each ESX host from each other ESX host in the cluster.
    Check length of Fully Qualified Domain Name (FQDN)
    FQDN too long - make sure the fully qualified domain name of all
    hosts is less than 29 characters.
    Check entries in /etc/hosts and /etc/resolv.conf
    • Put hostname.FQDN there as well
    • Ping each host, from each host, by hostname, by FQDN.
    • Ping the VC Server from each host
  • Check Log files in /opt/LGTOaam512/log
    Files to look out for:
    • aam_config_util_listprimaries.log - Shows the primary hosts
    • aam_config_util_listnodes.log The Importance of Planning
  • Plan effectively
    Assess workloads and their requirements
    Profile CPU/Memory/IO on existing platforms
    CPU/Memory/IO Mixture has to be balanced

http://download3.vmware.com/vmworld/2006/tac9413.pdf

0 Comments Permalink
Click to view nikkar's profile

nikkar

Member since: Apr 12, 2005

Things that happens and is good to be remembered.

View nikkar's profile