Host Isolation

Aneesh801 · ‎03-03-2012

Hi All,

I got an issue with an esxi host in a HA & DRS enabled cluster. The host got isolated from the network and all the virtual machine had resided in that server itself after the isolation also. But that should not happen if HA is enabled in the cluster, right. Then I have restarted the management network of the esx server and it got reconnected and all vms migrated to other hosts except one.

I am using esxi and vcenter server of version 5. I have 14 hosts in the cluster. I have attached the screen shots of HA settings in the cluster.

Anyone please help me to find why this happend and from which log files i can get the details.

Aneesh

a_p_ · ‎03-03-2012

With the "Host Isolation response" set to "Leave powered on" the VM's will not be restarted on other hosts in case of an isolation. You will need to change this setting to either "Shut down" or "Power off". In addition to the network heartbeat, vSphere 5 also uses a datastore heartbeat to avoid false positives in case only the Management network gets isolated. I assume the migration after restarting the Management Network was a "soft" migration (vMotion/DRS) rather than a restart of the VM's by HA!?

For details about HA see http://www.yellow-bricks.com/vmware-high-availability-deepdiv/

André

Aneesh801 · ‎03-03-2012

Hi Andre,

Thankyou very much for your reply. How can i know the actual reason for the host isolation? What logs should i check?

Aneesh

depping · ‎03-05-2012

You are experiencing the "default" behavior of HA. Only if you change the "isolation response" will the VMs be restarted. Now keep in mind that this could mean that in the case of a false positive (for instance your networkports for this server have a short 30 second dip) your VMs will also be restarted.

You can check the following log files to see why this happened:

/var/log/vmkernel.log

/var/log/hostd.log

/var/log/fdm.log

I would start with the FDM log file to figure out the time and see what happened and then use the vmkernel log file to determine the cause. But more than likely this is HW / Network related.

gvenkatsumanth · ‎03-05-2012

Hi Aneesh,

You can change your cluster HA settings as

Host Isolation response-Shutdown VM.

This will migrate the VM's if any host is down.

Regards,

Venkat.

depping · ‎03-05-2012

I would rather refer to it as restart instead of migrate. Migrate = vMotion... this is not seamless. VM powers off and it powered on.

Aneesh801 · ‎03-05-2012

Hi,

I am trying to find the cause from the logs. I had checked the fdm logs, seems they are cropped. Is there any other way to find the reason.

Aneesh

depping · ‎03-05-2012

Hopefully you have a syslog server set up or redirected your scratch? Try the following dir: /scratch/log/

gvenkatsumanth · ‎03-05-2012

But I think HA uses cold migration right?

Aneesh801 · ‎03-05-2012

Hi,

I have checked the logs and was able to find a little detals. The log details are pasted below.

FDM Log

================

2012-03-02T02:21:48.719Z [69C25B90 verbose 'Cluster'] ICMP reply for non-existent pinger 8 (id=healthMonHostPinger)
2012-03-02T02:22:16.694Z [FFF79400 verbose 'Cluster'] ICMP reply for non-existent pinger 4 (id=isolationPinger)
2012-03-02T02:24:02.379Z [69CA7B90 info 'Cluster' opID=SWI-67934037] [ClusterManagerImpl::LogState] hostId=host-1421 state=Slave master=host-1120 isolated=false host-li
2012-03-02T02:25:32.102Z [69BE4B90 info 'Election' opID=SWI-a736678e] Slave timed out
2012-03-02T02:25:32.102Z [69BE4B90 info 'Election' opID=SWI-a736678e] [ClusterElection::ChangeState] Slave => Startup : Lost master
2012-03-02T02:25:32.102Z [69BE4B90 info 'Cluster' opID=SWI-a736678e] Change state to Startup:0
2012-03-02T02:25:32.102Z [69CA7B90 verbose 'Cluster' opID=SWI-67934037] [ClusterManagerImpl::CheckElectionState] Transitioned from Slave to Startup
2012-03-02T02:25:32.102Z [69CA7B90 info 'Invt' opID=SWI-67934037] [InventoryManagerImpl::NotifyDatastoreUnlockedLocally] Invoked for datastore (/vmfs/volumes/4ed3f3ef

2012-03-02T02:25:47.193Z [69C66B90 verbose 'Cluster' opID=SWI-fc3b3619] [CleanupDir] Opening /vmfs/volumes/4f2d4947-c05fe85a-fb25-001018ad9a60/.vSphere-HA/FDM-7CB4C113-
2012-03-02T02:25:48.102Z [69BE4B90 verbose 'Election' opID=SWI-a736678e] [ClusterElection::MasterStateFunc] Am isolated! Dropping to STARTUP!
2012-03-02T02:25:48.102Z [69BE4B90 warning 'Election' opID=SWI-a736678e] Election error
2012-03-02T02:25:48.102Z [69BE4B90 info 'Election' opID=SWI-a736678e] [ClusterElection::ChangeState] Master => Startup : Election error
2012-03-02T02:25:48.102Z [69BE4B90 info 'Cluster' opID=SWI-a736678e] Change state to Startup:0
2012-03-02T02:25:48.102Z [69CA7B90 verbose 'Cluster' opID=SWI-67934037] [ClusterManagerImpl::CheckElectionState] Transitioned from Master to Startup
2012-03-02T02:25:48.102Z [69CA7B90 info 'Invt' opID=SWI-67934037] [InventoryManagerImpl::NotifyDatastoreUnlockedLocally] Invoked for datastore (/vmfs/volu

================

Vmkernal Log

2012-03-02T02:18:56.651Z cpu15:4111)ScsiDeviceIO: 2316: Cmd(0x412441a9ab40) 0x12, CmdSN 0x7ab82 to dev "naa.600508e000000000ffc76dddff40ae0d" failed H:0x0 D:0x2 P:0x0 V
2012-03-02T02:23:56.669Z cpu20:4116)ScsiDeviceIO: 2316: Cmd(0x4124413188c0) 0x12, CmdSN 0x7abde to dev "naa.600508e000000000ffc76dddff40ae0d" failed H:0x0 D:0x2 P:0x0 V
2012-03-02T02:25:24.732Z cpu3:4099)<3>[bnx2x_stats_update:5583(vmnic0)]storm stats were not updated for 3 times
2012-03-02T02:25:24.732Z cpu3:4099)<3>[bnx2x_stats_update:5584(vmnic0)]driver assert
2012-03-02T02:25:24.732Z cpu3:4099)<3>[bnx2x_panic_dump:933(vmnic0)]begin crash dump -----------------
2012-03-02T02:25:24.732Z cpu3:4099)<3>[bnx2x_panic_dump:940(vmnic0)]def_idx(0xc4b2) def_att_idx(0x4a) attn_state(0x0) spq_prod_idx(0x7d) next_stats_cnt(0xb903)
2012-03-02T02:25:24.732Z cpu3:4099)<3>[bnx2x_panic_dump:945(vmnic0)]DSB: attn bits(0x0) ack(0x100) id(0x0) idx(0x4a)
<3>[bnx2x_panic_dump:946(vmnic0)]     def (0x0 0x0 0x0 0x0 0x0 0x0 0x0 0xdb8d 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0) 2012-03-02T02:25:24.732Z cpu3:4099)igu_sb_id(0x0) igu_s
2012-03-02T02:25:24.732Z cpu3:4099)<3>[bnx2x_panic_dump:990(vmnic0)]fp0: rx_bd_prod(0xc413) rx_bd_cons(0x14) rx_comp_prod(0x8575) rx_comp_cons(0x8172) *rx_cons_sb(0
2012-03-02T02:25:24.732Z cpu3:4099)<3>[bnx2x_panic_dump:994(vmnic0)]     rx_sge_prod(0x0) last_max_sge(0x0) fp_hc_idx(0x8ede)
2012-03-02T02:25:24.732Z cpu3:4099)<3>[bnx2x_panic_dump:1001(vmnic0)]fp0: tx_pkt_prod(0x495e) tx_pkt_cons(0x495e) tx_bd_prod(0x8e59) tx_bd_cons(0x8e58) *tx_cons_sb(
<3>[bnx2x_panic_dump:1012(vmnic0)]     run indexes (0x8ede 0x0)<3>[bnx2x_panic_dump:1018(vmnic0)]     indexes (0x0 0x8172 0x0 0x0 0x0 0x495e 0x0 0x0)2012-03-02T02:25:24
2012-03-02T02:25:24.732Z cpu3:4099)SM[0] __flags (0x0) igu_sb_id (0x2) igu_seg_id(0x0) time_to_expire (0x600a9719) timer_value(0xff)
2012-03-02T02:25:24.732Z cpu3:4099)SM[1] __flags (0x0) igu_sb_id (0x2) igu_seg_id(0x0) time_to_expire (0xffffffff) timer_value(0xff)
2012-03-02T02:25:24.732Z cpu3:4099)INDEX[0] flags (0x0) timeout (0x0)

================

Hostd Log

================

2012-03-02T02:23:42.278Z [69C81B90 verbose 'ha-license-manager' opID=HB-host-1421@3103-e5a5aeae-78] Load: Loading existing file: /etc/vmware/license.cfg
2012-03-02T02:23:42.289Z [69C81B90 verbose 'Default' opID=HB-host-1421@3103-e5a5aeae-78] ha-license-manager:Validate -> Valid license detected for "VMware ESX Server 5.
2012-03-02T02:23:50.946Z [6A638B90 verbose 'Cimsvc'] Ticket issued for CIMOM version 1.0, user root
2012-03-02T02:23:51.900Z [6A040B90 verbose 'SoapAdapter'] Responded to service state request
2012-03-02T02:23:53.918Z [6A5A4B90 verbose 'SoapAdapter'] Responded to service state request
2012-03-02T02:24:00.038Z [6A638B90 warning 'Statssvc'] Calculated read I/O size 1041981 for scsi2:0 is out of range -- 1041981,prevBytes = 131918865920 curBytes = 13208
2012-03-02T02:24:10.976Z [6A679B90 verbose 'Default'] Power policy is unset
2012-03-02T02:24:10.976Z [6A5E5B90 verbose 'Default'] Power policy is unset
2012-03-02T02:24:12.243Z [6A39CB90 verbose 'Proxysvc Req08311'] New proxy client SSL(TCP(local=10.221.11.11:443, peer=10.221.11.201:63621))
2012-03-02T02:24:12.267Z [69CC2B90 verbose 'Locale' opID=HB-host-1421@3104-e045914c-ed] Default resource used for 'counter.virtualDisk.commandsAborted.label' expected i
2012-03-02T02:24:12.267Z [69CC2B90 verbose 'Locale' opID=HB-host-1421@3104-e045914c-ed] Default resource used for 'counter.virtualDisk.commandsAborted.summary' expected
2012-03-02T02:24:12.267Z [69CC2B90 verbose 'Locale' opID=HB-host-1421@3104-e045914c-ed] Default resource used for 'counter.virtualDisk.busResets.label' expected in modu
2012-03-02T02:24:12.267Z [69CC2B90 verbose 'Locale' opID=HB-host-1421@3104-e045914c-ed] Default resource used for 'counter.virtualDisk.busResets.summary' expected in mo

================

But not able to find what was the actual reason for the isolation. Please have a look into these logs and let me know if you are getting any clues.

Aneesh

depping · ‎03-05-2012

I am not a support guy but this indicates to me that there was a problem with your NICs considering the "bnx2x_panic_dump" in the messages. Are you running the latest firmware / NIC drivers and vSphere builds?

Aneesh801 · ‎03-10-2012

Hi,

Thanks for your reply.

I had searched for getting a clue and went across some kb articles.

http://kb.vmware.com/selfservice/search.do?cmd=displayKC&docType=kc&externalId=1029368

But in this article it is saying that the issue is already resolved in the previous versions of bnx2x network drivers.The version i am using is 1.61.15.v50.1, Build: 469512 and esxi of version 5.0. But the issue is still there .

Aneesh

depping · ‎03-11-2012

Can I suggest filing a support ticket? They will go through the log files and try to figure out what happened / why and how to mitigate it.

Aneesh801 · ‎03-11-2012

Hi Ducan,

I am trying to file a support ticket or call them. Will let u know their reply. Thanks for your help.

Aneesh

depping · ‎03-15-2012

Okay, let us know what the result is or feel free to post the SR number so we can keep track ourselves.

Aneesh801 · ‎04-07-2012

Hi All,

I have updated the Broadcom bnx2x driver to the version 1.70.34.50.1 and monitored the servers for some day. Till now there is no isolation reated issues. Seems the issue had been fixed due to the driver update.

Aneesh

Aneesh801 · ‎04-07-2012

Upgardation of the Broadcom driver fixed the issue.

depping · ‎04-07-2012

Thanks for letting us know what fixed the problem...