VMware Cloud Community
andvm
Hot Shot
Hot Shot

Power-on reset Events

Hi,

Am recently noticing frequent events on one of the hosts that forms part of a vSAN cluster.

[vob.scsi.scsipath.por] Power-on Reset occurred on naa.xxxxxxxxxxxxxx

These events refer to different disks and not just one and look to occur every few mins.

I ssh'd to the specific host and typed esxcli van storage list and the disks match with the vSAN disks.

Any other commands I should issue to check further or any recommended actions?

vSAN Health and Server hardware look all green.

Other servers in the cluster do not have such events.

Thanks

Tags (1)
Reply
0 Kudos
2 Replies
TheBobkin
Champion
Champion

Hello andvm

As you mention that you are seeing multiple disks getting reset, there could be an issue on the controller, on the cache-tier device of one affected Disk-Group or on a single Capacity-tier device (if the DG is deduped).

Can you attach/PM the vmkernel.log and output of vdq -Hi to see can we narrow this down?

Other things to look for:

- aberrant latency on some devices (e.g. grep -i increased /var/log/vmkernel.log OR esxtop 'u')

- Indications of the controller driver/firmware having issues (dmesg OR less /var/log/vmkernel.log)

- Validate that the controller and its driver+firmware are on the HCL - please note that checking the Health check for this being Green is not sufficient as this won't check devices that are not on the HCL at all (as it just assumes you are using them for non-vSAN purposes) - validate that it has checked the controllers in question.

Bob

Reply
0 Kudos
andvm
Hot Shot
Hot Shot

Hi TheBobkin

My findings so far...

HBA firmware/driver matches vSAN VCG

Commands run and disks with power reset issue in logs match disks listed in vdq -Hi

I do see a few events around the time the issue started relating to increased latency (Do not however particularly see high spikes in latency within VM/Backend Graphs)

In vmkernel.log amongst others I see events related to lsi_mr3 megasas_hotplug_work etc... I will need to check further as seeing VMware Knowledge Base  might be related although servers running on ESXi 6.5U2

I am thinking of placing the host in MM and reboot it, leave it running for a few days and see if get similar events and if so raise with GSS

Thanks

Reply
0 Kudos