We recently upgraded our vCenter and our ESXi hosts to 6.7 U3.
We noticed that the SEAT DB was filling up quite fast.
Last week stats:
Somehow we are getting a lot more events compared to pre-upgrade (6.5).
To avoid reaching 100% usage, I purged some of the biggest event tables in VCSA and I changed all statistics level to level 1 and changed retention to 30 days (default settings).
I investigated and found that our vCenter is logging plenty of "Host hardware sensor state" events:
These events are new and we didn't have them pre-upgrade.
I found other people having the same issue:
I created a ticket at VMware support. I'm waiting for an agent to contact me.
Meanwhile, does anyone else experience this?
Do you have a trick to stop these specific "Host hardware sensor state" events from being generated?
VMware support told me it's a known and recent issue and engineers are working on it...
I have another, smaller, environment with just 3 hosts and 1 vCenter, all patched the same as environment above, ESXi and VCSA 6.7U3. However, I don't have that issue in that smaller environment. I'm now comparing what's different between those 2 environments (installed VIBs on hosts, advanced settings etc.)
Anyone else find this "workaround" to be decidedly sub-optimal? I mean,when the issue is "our QA team didn't catch that log growth on the vCenter SEAT volume is dramatically higher in the new release due to an unmitigated FLOOD of host hardware sensor state messages" the answer can't simply be "well, everyone just turn off WBEM and stop monitoring your host health in vCenter. Problem solved!" or "Just manually truncate tables in the vCenter db - what could possibly go wrong..." That's like saying if you're experiencing an issue with datastores filling up simply turn off capacity alerting in vCenter. This sensor-state alerting problem is not vendor-specific so there is no good reason this should not have been discovered prior to releasing 6.7U3 out the door, or at a bare minimum I hope there is now a new checklist item for VMware QA to look at overall log write rates against a baseline when evaluating new build candidates for GA. I guess as a short-term workaround to keep vCenter up and running I can accept the workaround - I'm truncating the VCSA logs because I actually WANT to receive hardware health alerting in vCenter - but we're now almost 5 weeks past the GA date of 6.7u3 and there does not seem to be any sense of urgency in releasing a real resolution to this issue via host patch. Am I over-blowing this whole thing?
Wanted to note that there is updated content in VMware Knowledge Base article today that, as near as I can tell, give you a NEW option to create a rule to ignore hardware sensor events only from specific hardware sensors. Oh great - instead of telling me to cut the wires to ALL the warning lights on my car's dashboard now the recommendation is to just snip the wire to the light that corresponds to where the invalid error is originating. I guess I'm the only one finding this type of solution to be totally unacceptable - how about releasing a patch which actually *resolves* the excess hardware sensor alert generation in the first place?
Why hasn't even a single post-U3 patch been released yet? At this point in the patching lifecycle there had already been TWO patches released post-U2, and a third post-U2 patch was only 8 more days away from release. It's now been *62* days since the release of U3 and crickets from VMware. I gotta be honest, I'm in a large enterprise infrastructure and manually truncating my SEAT disk db tables every 4-5 days to avoid vCenter being inaccessible is not sitting well with me at this point. I was willing to do it as a temporary workaround, but TWO MONTHS..... And yes, I still want to receive valid hardware alerts from vCenter - it's not my only alerting mechanism, but it provides important redundancy in the event of a hardware failure on a host.
update-from-esxi6.7-6.7update02 - 04/11/2019 - U2 release day
ESXi670-201904001 - 04/30/2019 - 19 Days after U2 release
ESXi670-201905001 - 05/14/2019 - 33 days after U2 release
ESXi670-201906002 - 06/20/2019 - 70 days after U2 release
update-from-esxi6.7-6.7_update03 - 08/20/2019 U3 release day
today 10/21/2019 is *62* days after U3 release date and nothing....
Same boat, crazy it has been this long with no patch release to fix the issue.
I guess it's just not a big of a priority to them as it is to some of us, I don't like running with no hardware alerts but it was flooding vROPS and vCenter along with filling one of the drives.
It's nice to get a new release of vRealize Suite but I'd rather have a fix that allows me to enable the CIM provider.
Maybe next month as October is almost to a close lol
Although the "Sensor System Chassis 1 UID" still shows as Unknown Status in the Hardware Health monitor for me after upgrade, it appears as though the 'Sensor -1 health events flooding the logs' issue is resolved in patch ESXi-6.7.0-20191104001 released last night.
"After upgrading to ESXi 6.7 Update 3, you might see Sensor -1 type hardware health alarms on ESXi hosts being triggered without an actual problem. This can result in excessive email alerts if you have configured email notifications for hardware sensor state alarms in your vCenter Server system. These mails might cause storage issues in the vCenter Server database if the Stats, Events, Alarms and Tasks (SEAT) directory goes above the 95% threshold."
Not to hijack the thread...but we have built new (clean install) ESXi 6.7 U3b hosts and we don't see the 'Sensor -1 type' errors in the syslog anymore, we do have the Chassis UID and Sys Health LED listed as Unknown, however we are seeing the below message spew...do you see the same messages? We see (10) attempts every 15 seconds.
sfcb-vmw_ipmi [random number]: IpmiIfcSelGetInfo: IPMI_CMD_GET_SEL_INFO cc=0xc1
I recently experienced this issue on a couple HPE ProLiant DL360 Gen9 servers with errors like the following (Alarm 'Host hardware sensor state' on esx.abcd.com. triggered by event 14306564 'Sensor -1 type , Description Disk 7 on HPSA1 : Port 2I Box 3 Bay 7 : 3576GB : Unconfigured Disk : OK state deassert for . Part Name/Number N/A N/A Manufacturer N/A').
In my case, I had already applied the patch referenced in VMware Knowledge Base weeks ago and rebooted the hosts. However, in order to stop the errors I had to login to the deprecated Flash vSphere web client and reset the host system logs and sensors.
After upgraded to Esxi 6.7, we got lots of alters like:
08/19/2020, 6:07:44 AM ... event 3544209 'Sensor 3 type other, System Chassis 1 NMI State 0
08/19/2020, 6:07:44 AM ... event 3544208 'Sensor 177 type other, Group 4 CPU
08/19/2020, 6:07:44 AM ... event 3544207 'Sensor 135 type other, Group 4 PECI Bus
08/19/2020, 6:07:44 AM ... event 3544206 'Sensor 179 type other, Group 2 PCI
08/19/2020, 6:07:44 AM ... event 3544205 'Sensor 178 type other, Group 1 DIMM
08/19/2020, 6:07:44 AM ... event 3544204 'Sensor 25 type power, Power Supply 1 PS 1 Status 0
08/19/2020, 6:07:44 AM ... event 3544203 'Sensor 144 type power, Power Module (DC-to-DC) 10 CPU 1 VRD 0
they are diverse Sensor alerts.
I read the discuss led to "Sensor -1 type" , we have various Sensor alerts but there is no "Sensor -1 type" .
Does any one have got patch for this issue and share here, thanks.