VMware Cloud Community
vervoortjurgen
Hot Shot
Hot Shot
Jump to solution

ESX host unresponsive

hi

now i have a problem.

a customer has 3 IBM HS23 blade.

for the past 7 months i had 4 times that all esx host became offline/unresponsive

all vm's keep running so production keeps running

esx host don't respond on SSH or telnet connection

i can login after pressing F2 and F12 but then the host just hangs

i cant login in vi client to the ESX host either.

all host are installed on USB disk from IBM

only thing to resolve it is to power off the blades and start them again

its also random. i had 3 month running good , 1 day running good and 3 weeks running good

the vpxd.log file only shows time outs to communicated with esx host and then the error offline appears

already updated all my firmware's (chasis, blades, storage, brocade fiber and IBM switches)

i started with vsphere 5.5 U1 and ESXI 5.5 u1 from IBM. already updates to vsphere 5.5 U2 and ESXi 5.5 u2 from vmware (since ibm doesn't have a custom ESXi 5.5 u2)

im lil stuck.. last event is on 16/01/2015 (and ibm wont give support since there is only 3Y subcription and no software support contract)

a last point i'm running veeam bacup and replication 7.x but don't see a problem there.

thanks in advance.

kind regards Vervoort Jurgen VCP6-DCV, VCP-cloud http://www.vdssystems.be
Tags (2)
Reply
0 Kudos
1 Solution

Accepted Solutions
vervoortjurgen
Hot Shot
Hot Shot
Jump to solution

i found the problem

i restarted only the iSCSI datastore and all the ESXi hosts became responsive again

so after searching the qnap forum i noticed that they released an update on 29/01/2015

testing the firmware now and see what happen

if this fails i'm guessing the qnap ts-469L isnt supported anymore for vsphere 5.5

thanks all for suggestions

kind regards Vervoort Jurgen VCP6-DCV, VCP-cloud http://www.vdssystems.be

View solution in original post

Reply
0 Kudos
9 Replies
jrmunday
Commander
Commander
Jump to solution

Hi vervoortjurgenvervoort jurgenvervoort jurgen,

This reminds me of a similar issue I faced in the past on vSphere 5.0, here are the details from 2 years ago;

Exhausting inodes + Disconnected Host

Re: Free INODES and % free RAMDISK

I eventually got a hotfix from VMware to address this, but in the interim I tweaked the script to monitor inodes and ramdisk and email me every time thresholds were reached (so I could react before there was an outage). I can dig this out if you need it.

I also remember the alerts being generated after a support bundle was created which filled up the TMP volume.

It would be interesting if you have the same issue?

Cheers,

Jon

vExpert 2014 - 2022 | VCP6-DCV | http://www.jonmunday.net | @JonMunday77
Reply
0 Kudos
admin
Immortal
Immortal
Jump to solution

Is the microcode up-to-date?

Reply
0 Kudos
prasannag6
Enthusiast
Enthusiast
Jump to solution

We could probably start with hostd, vmkernel logs. Could you please upload them?

The APD issue caused by storage device loss also caused such behaviour. To confirm, vmkernel logs show these verbiage - PERM LOSS, failed with status Device is permanently unavailable etc

----------If you found this or any other answer helpful, please consider to award points (use Correct or Helpful buttons). Regards, Prasanna----------
Reply
0 Kudos
FritzBrause
Enthusiast
Enthusiast
Jump to solution

Yes. Sounds like a hostd hang.

Check /var/log/syslog.log, vobd.log, vmkwarning.log and hostd.log before the hang occurred.

Could be a memory leak, full ramdisk, no free inodes (like the ones mentioned above), or PDL/APD (storage not available).

Perhaps stop any 3rd party SW on the host (like HW monitoring etc.).

Reply
0 Kudos
vervoortjurgen
Hot Shot
Hot Shot
Jump to solution

yes all micro code is up to date

IBM confirms no errors on the hardware

attaching the logs files

last error occurred at 27-01-2015 at 20:20. had to reboot the hosts because customer needs management of hosts

now im thinking it could be the CPU load.

in production i have 70% cpu load

also the CPU is poorly i think E5-2609 2,4 GHz

monitoring software is stopped, only veeam backups runs at 18:00 until 24:00

mcafee move also active

don't see the errors you all mention.

any suggestions?

kind regards Vervoort Jurgen VCP6-DCV, VCP-cloud http://www.vdssystems.be
Reply
0 Kudos
Titanomachia
Enthusiast
Enthusiast
Jump to solution

are you able to connect via the console and use ESXTOP and view network stats for dropped packets? Are you able to also check the config on the management ports? I had this with a faulty NIC that negotiated to from 1Gb to 100Mb

Reply
0 Kudos
vervoortjurgen
Hot Shot
Hot Shot
Jump to solution

hello

an update

i have now 2 vsphere environments with this problem

so i compared the logs

i think its the iscsi datastore that makes my host unresponsive

ive been reading on the internet and alot of persons seems to have problems since the update2?

alot of iSCSI storage deivce arent supported anymore?

anyway i have a case open with veeam because i use my iSCSI storage for replica most of the time

hopefully they can confirm my findings

kind regards Vervoort Jurgen VCP6-DCV, VCP-cloud http://www.vdssystems.be
Reply
0 Kudos
FritzBrause
Enthusiast
Enthusiast
Jump to solution

You mentioned a reboot at 27-01-2015 at 20:20.

But vmksummary.log does not show any reboot on the 27th Jan.

2015-01-29T19:03:36Z bootstop: Host has booted

2015-01-29T21:56:21Z bootstop: Host is rebooting

2015-01-29T22:02:46Z bootstop: Host has booted

Around this time, no errors in the logs. Some logs are cycled already and the older ones are in /var/run/log.

The only thing is this here in vmkarning.log:

2015-01-29T13:36:00.706Z cpu1:33451)WARNING: iscsi_vmk: iscsivmk_TaskMgmtIssue: vmhba33:CH:0 T:0 L:0 : Task mgmt "Abort Task" with itt=0x6cac7 (refITT=0x68fcd) timed out.

2015-01-29T13:36:12.709Z cpu2:33451)WARNING: iscsi_vmk: iscsivmk_TaskMgmtIssue: vmhba33:CH:0 T:0 L:0 : Task mgmt "Abort Task" with itt=0x6cac8 (refITT=0x68fcd) timed out.

2015-01-29T13:36:24.712Z cpu3:33451)WARNING: iscsi_vmk: iscsivmk_TaskMgmtIssue: vmhba33:CH:0 T:0 L:0 : Task mgmt "Abort Task" with itt=0x6cac9 (refITT=0x68fcd) timed out.

This is ongoing every 12 seconds where vmkernel tries to abort a SCSI task.

Yes, you should investigate research in the storage and iSCSI if all is correct here.

Check if those lines come up every time the servers got stuck.

Besides that there is nothing else in the logs pointing to any problem.

Check /var/run/log since some log files in /var/log are already cycled.

vervoortjurgen
Hot Shot
Hot Shot
Jump to solution

i found the problem

i restarted only the iSCSI datastore and all the ESXi hosts became responsive again

so after searching the qnap forum i noticed that they released an update on 29/01/2015

testing the firmware now and see what happen

if this fails i'm guessing the qnap ts-469L isnt supported anymore for vsphere 5.5

thanks all for suggestions

kind regards Vervoort Jurgen VCP6-DCV, VCP-cloud http://www.vdssystems.be
Reply
0 Kudos