VMware Cloud Community
Geo55
Contributor
Contributor

Not understand about crash of my ESXi servers

History :

I've a cluster of three same ESXi servers (version 5.5.0 u1 / DELL PowerEdge R610)  with their vCenter for more than one year.

The cluster running correctly until january.

The first accidents were VMs linux : the file format failed and reloaded in read only => reboot vm and ok !

The following accidents were a crash of an ESX (SSH of ESXi responded but when an command line was writed, there was no response like blocked) :

    -> VMs on that server were no longer visible or accessible

    -> Problem with HA (Host failure, VM and application monitoring) that has not detected the problem and did not restart the remote VMs.

    -> By cons, the fact having to stop and restart the server,  HA has triggered.

After the incident and later, an other ESX of the cluster crashed with the same problem and soluce.

After the incident and later, ESX of the cluster crashed again...

The only display I've seen in the logs ("/var/log/vmkwarning.log") was :

    2016-03-05T14:20:01.487Z cpu11:34017119)ALERT: hostd detected to be non-responsive

    ...

    2016-03-06T16:00:01.403Z cpu1:34258627)ALERT: hostd detected to be non-responsive

    2016-03-06T16:35:01.761Z cpu12:34264098)ALERT: hostd detected to be non-responsive

So,

I decided to completely rebuild a blank cluster with the same servers and the last version of ESXi and vCenter "5.5.0 u3".

Datastores have been connected and VMs were revived.

Everything went well for a week...

When I've the unpleasant surprise of a crash of an ESX Server again.

With the same even logs before the crash ("/var/log/vmkwarning.log") :

    2016-03-15T15:10:02.206Z cpu12:1969007)ALERT: hostd detected to be non-responsive

    2016-03-15T15:49:02.025Z cpu11:1974883)ALERT: hostd detected to be non-responsive

---

I don't know what to do,

I don't know where to look to find my concern.

Is it possible that a VM can make an ESX crash?

Thank you for your ideas and your support,

Bye.

GEO

0 Kudos
3 Replies
DavoudTeimouri
Virtuoso
Virtuoso

Please share more details such as storage devices type, ESXi log files.

-------------------------------------------------------------------------------------
Davoud Teimouri - https://www.teimouri.net - Twitter: @davoud_teimouri Facebook: https://www.facebook.com/teimouri.net/
0 Kudos
Geo55
Contributor
Contributor

Hello ^^

Thank you about response.

More details such as storage,

ESX and VM are connected to DELL PS4001 with eSCSI (without encryption).

Difficult about log files,

Just around "ALERT: hostd detected to be non-responsive" :

2016-03-15T13:16:20.036Z cpu2:465325)WARNING: CBT: 2060: Unsupported ioctl 60

2016-03-15T13:16:20.036Z cpu2:465325)WARNING: CBT: 2060: Unsupported ioctl 59

2016-03-15T13:16:20.036Z cpu2:465325)WARNING: CBT: 2060: Unsupported ioctl 43

2016-03-15T13:45:10.541Z cpu6:32803)WARNING: ScsiDeviceIO: 1223: Device naa.603be8bf6ea1fec5e553f50100002004 performance has deteriorated. I/O latency increased from average value of 43368 microseconds to 909475 microseconds.

2016-03-15T13:45:16.733Z cpu5:32802)WARNING: ScsiDeviceIO: 1223: Device naa.603be8bf6ea1fec5e553f50100002004 performance has deteriorated. I/O latency increased from average value of 43369 microseconds to 878421 microseconds.

2016-03-15T14:22:08.356Z cpu1:33024)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237: NMP device "naa.603be8bf6ea1fec5e553f50100002004" state in doubt; requested fast path state update...

2016-03-15T14:22:19.388Z cpu13:32810)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237: NMP device "naa.603be8bf6ea1fec5e553f50100002004" state in doubt; requested fast path state update...

2016-03-15T14:22:19.413Z cpu3:46602)WARNING: ScsiDeviceIO: 1223: Device naa.603be8bf6ea1fec5e553f50100002004 performance has deteriorated. I/O latency increased from average value of 43318 microseconds to 1121555 microseconds.

2016-03-15T14:22:19.414Z cpu3:46602)WARNING: ScsiDeviceIO: 1223: Device naa.603be8bf6ea1fec5e553f50100002004 performance has deteriorated. I/O latency increased from average value of 43318 microseconds to 2499496 microseconds.

2016-03-15T14:22:35.508Z cpu13:32810)WARNING: ScsiDeviceIO: 1223: Device naa.603be8bf6ea1fec5e553f50100002004 performance has deteriorated. I/O latency increased from average value of 43320 microseconds to 5924671 microseconds.

2016-03-15T14:48:26.356Z cpu14:32941)WARNING: VSCSI: 2850: handle 8468(vscsi0:1):Retry 0 overdue by 2 seconds

2016-03-15T14:48:27.276Z cpu12:464394)WARNING: VSCSI: 3573: handle 8467(vscsi0:0):WaitForCIF: Issuing reset;  number of CIF:1

2016-03-15T14:48:27.276Z cpu12:464394)WARNING: VSCSI: 2495: handle 8467(vscsi0:0):Ignoring double reset

2016-03-15T15:10:02.206Z cpu12:1969007)ALERT: hostd detected to be non-responsive

2016-03-15T15:49:02.025Z cpu11:1974883)ALERT: hostd detected to be non-responsive

TSC: 779379 cpu0:1)WARNING: ACPI: 1386: SPCR: Detected unsupported reg bit width (0); will assume 8 bits.

TSC: 785046 cpu0:1)WARNING: ACPI: 1435: SPCR: Detected invalid baud rate (0); will assume 115200

0:00:00:00.000 cpu0:1)WARNING: Serial: 646: Invalid serial port config: mem-mapped to addr 0x0.

2016-03-15T16:19:49.870Z cpu7:33382)WARNING: LinuxSignal: 538: ignored unexpected signal flags 0x2 (sig 17)

2016-03-15T16:19:51.139Z cpu10:33262)WARNING: Team.etherswitch: TeamES_Activate:692: Failed to initialize beaconing on portset 'pps': Not implemented.

2016-03-15T16:19:54.773Z cpu4:33421)WARNING: LinScsiLLD: scsi_add_host:573: vmkAdapter (usb-storage) sgMaxEntries rounded to 255. Reported size was 65535

2016-03-15T16:19:54.774Z cpu4:33421)WARNING: LinScsiLLD: scsi_add_host:573: vmkAdapter (usb-storage) sgMaxEntries rounded to 255. Reported size was 65535

2016-03-15T16:20:07.976Z cpu10:33433)WARNING: ScsiScan: 1408: Failed to add path vmhba0:C0:T0:L0 : Not found

2016-03-15T16:20:07.979Z cpu10:33433)WARNING: ScsiScan: 1408: Failed to add path vmhba0:C0:T1:L0 : Not found

2016-03-15T16:20:12.377Z cpu12:33262)WARNING: NetDVS: 567: portAlias is NULL

2016-03-15T16:20:12.514Z cpu12:33262)WARNING: Tcpip_Vmk: 808: Failed to set default gateway (51): Network unreachable

2016-03-15T16:20:12.517Z cpu12:33262)WARNING: NetDVS: 567: portAlias is NULL

2016-03-15T16:20:12.520Z cpu12:33262)WARNING: Tcpip_Vmk: 808: Failed to set default gateway (51): Network unreachable

2016-03-15T16:20:12.523Z cpu12:33262)WARNING: NetDVS: 567: portAlias is NULL

2016-03-15T16:20:12.526Z cpu12:33262)WARNING: Tcpip_Vmk: 808: Failed to set default gateway (51): Network unreachable

2016-03-15T16:20:12.529Z cpu12:33262)WARNING: NetDVS: 567: portAlias is NULL

What other file would help?

Thanck you very much.

0 Kudos
Geo55
Contributor
Contributor

Hello,

I've updated to 5.5.0u3 but same problem Smiley Sad

Just to say that I resolved my problem by downgrading the version for ESX servers from version 5.5.0 u3 to 5.5.0.

Is it possible as hardware problem ?

Nevertheless, thank you.

0 Kudos