Re: Alert Host IPMI System Event Log status.

VMMyke · ‎02-25-2014

I have a VMware cluster running on a blade server and all but one of the HOSTS report this error Alert Host IPMI System Event Log status. I did a Google search and came up with this fix VMware KB: The Host IPMI System Event Log Status alarm is triggered repeatedly in VMware vCenter ... In step 4 i get this The system event log is empty. Any help with this would be great.

Resolution

To determine why the log has filled up, investigate the hardware.

To resolve this issue, stop the alarm from triggering repeatedly and clear the IPMI System Event log and reset the sensors.

To clear the log and reset the sensors:

Open vCenter Server using vSphere Client.
In the vCenter Inventory, select the ESXi/ESX host.
Click the Hardware Status tab.
Click System Event log under View.
Click Reset Event Log. The red alert is removed from the System Event log.
Click Reset Sensors to reset the host sensors.

Introduced in ESXi 5.1 Update 2 (Build Number: 1483097) and ESXi 5.5 Patch 1 (Build Number: 1474528), there is a new localcli command line to clear the IPMI SEL logs:

localcli hardware ipmi sel clearTo execute this command on the ESXi 5.1 or 5.5 host:

Connect to the ESXi host via SSH. For more information, see Using Tech Support Mode in ESXi 4.1 and ESXi 5.x (1017910).
Execute this command:

localcli hardware ipmi sel clear

ViennaAustria · ‎06-29-2014

I cannot help you with that, but I have the very same problem.

It seems to be connected to Dell servers and mostly to Supermicro mainboards. I never had it on an HP server yet. It began several months ago on a Dell server and every week or so another machine starts to throw those errors. Currently I see them on about 20 hosts, spread across customers in Austria and Hungary.

I followed the VMware KB article - no success: the system event log is empty and thus no clear button. The `localcli´ command did nothing.

I re-flashed the BMC firmware wih the latest versions, reset everything to defaults, cleared all logs - no success.

I wonder if it is connected to the IPMI weakness exploits, that are currently under attack in many systems I know.

Vulnerability Note VU#648646 - Supermicro IPMI based on ATEN firmware contain multiple vulnerabiliti...

We reacted by adding ACLs - either in the IMPI firmware itself or in the routers. Since then we have no bandwidth issues any more, but I'm unsure if we really secured the IPMI/BMC systems. Maybe the IPMI error in the vCenter servers reflects another hack or attempt?

jeremyahagan · ‎07-07-2014

Same issue here with 1 server out of a fleet of 70 HP DL360 G8s. Event log is empty and I am running 5.0 so no chance of using the command line.

NLMK_Coating · ‎07-08-2014

Hello,

Since I have updated my servers Proliant DL360p Gen8, I got the same problems.

I got more than 100 alerts during the last 24h from my 3 ESXi ([VMware vCenter - Alarme alarm.SELHealthAlarm])

I did all the reset adviced but nothing changed.

Is there someone who can help us ?

PARENT Thomas.

jeremyahagan · ‎07-08-2014

I've logged a case with VMware Support. I will report back here.

Wh33ly · ‎07-09-2014

I had the same and used some the tricks above, sometimes and worked sometimes it didn't.

But somewhere I found a trick from someone to remove the ESX host from the inventory and re-add it. Somehow something hangs in vCenter/DB which keeps getting these messages back. Sometimes it was a solution for me

jeremyahagan · ‎07-09-2014

I had a remote session with VMware support. I only have 1 affected server so we did a tail on the /var/log/syslog.log file and found the following repeated over and over:

2014-07-08T00:55:34Z sfcb-vmware_raw[3569]: IpmiIfcSelGetAll: no record count

2014-07-08T00:56:12Z sfcb-vmware_raw[3569]: IpmiIfcSelReadEntry: data length mismatch req=19,resp=3

2014-07-08T00:56:15Z sfcb-vmware_raw[3569]: IpmiIfcSelReadEntry:error 203.

2014-07-08T00:56:15Z sfcb-vmware_raw[3569]: IpmiIfcSelReadEntry: retry expired.

2014-07-08T00:56:15Z sfcb-vmware_raw[3569]: IpmiIfcSelReadNew: failed call to IpmiIfcSelReadEntry cc = 0xff

Which led to an article that didn't exactly match the symptoms: http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=207769...

But the fix does appear to have rectified the issue. It was to restart the management agents from the console using services.sh

NLMK_Coating · ‎07-10-2014

For information, I solved my problem with the HP support. It was due to the BIOS server (DL360p) service pack update. I had to shutdown the ESX, connect to the HP console interface (Intelligent Provisioning) and delete all the logs of the physical server. I restarted this physical server. On my Vcenter interface, I updated the hardware status and reseted the sensors. It is OK for me. I hope this will be helpful for someone. Thomas.

ViennaAustria · ‎07-14-2014

After installing the recent ESXi updates all the IPMI issues vanished. I don't know which of the 7 updates is responsible in particular, but I have no IPMI errors since several days.

Wh33ly · ‎07-15-2014

What also works is to login through SSH

Clear the IPMI logfiles with:

localcli hardware ipmi sel clear

Restart the hardware status provider

/etc/init.d/sfcbd-watchdog restart

You also can list the logs with :

localcli hardware ipmi sel list

Last time I noticed these messages was when upgrading to the latest HP ILO version

venkyVM · ‎07-15-2014

Hi,

What do you get when you run /bin/enum_instances CIM_RecordLog?

diegoazevedo · ‎08-06-2014

In my case, vCenter was showing server had more than 61000 entries on the log. Using the iDRAC (Dell R710), I could only see 19 entries, which matched the number of logs I saw using localcli hardware ipmi sel list.

Restarting the management on the ESXi host did the trick here. vCenter now shows only 19 records on IPMI SEL event log.

AnikSwamin · ‎11-07-2014

IPMI unknown issue was fixed by restarting 'small footprint cim broken daemon'

/etc/init.d/sfcbd-watchdog restart

Alistar · ‎11-07-2014

I have got a PowerCli script for this - we are facing the same bug on HP blades in our environment

What it does is basically everything described above, but without tne hassle of connecting to each host manually - it's a little bit modular so take a look on the code first if you understand the mnemonics, if not just run it, enter your vCenter and VM name (don't forget to see the $hosts value - it should contain your domain't name!) and wait for a moment

Write-Host "Reset sensors, refresh HW data & their views on an ESXi host" `n -ForegroundColor Green

<# Uncomment this to enable connection to vCenters defined in a text file

#

$vcenterlist = Get-Content .\vcenters.txt

ForEach ($vcenter in $vcenterlist) {

Define vCenter

Write-Host `n"Connecting to vCenter $vcenter"

# Connect to vCenter

Connect-VIServer $vcenter | Out-Null

}

#>

# Define a blank array for the hosts

$hosts = @()

# input checking loop to check if $vcenter is null or not.

if ($vcenterlist -eq $null) {

do {

[Boolean]$match = $false

$vcenter = Read-Host "Define a vCenter where the host is located"

$vcenter.Replace(' ','')

if ($vcenter -notlike "") { $match = $true }

Else {

Write-Host "The value must not be null. Try again or CTRL+C to break."`n -ForegroundColor Red

$match = $false

}

} Until ($match -eq $true)

}

# ESXi host definition

$input = Read-Host "Enter a name of ESXi host where you want to reset the HW sensors"

# Generate FQDN and store into an Array

$hosts += "$input`.yourdomain.lab"

# Connect to vCenter

Write-Host `n "Connecting to vCenter $vcenter`..."

ForEach ($vcenter in $vcenterlist) {

Connect-VIServer $vcenter | Out-Null

}

# The VMhost needs to be stored into an array with Get-VMhost for further processing

$vmhosts = Get-VMHost -Name $hosts

# Get all vmhosts for the connected vCenter sessions

#$vmhosts = Get-VMHost

ForEach ($vmhost in $vmhosts)

{

Try

{

#initialize calls for refreshing hardware status..

Write-Host "Restarting CIM Server service on $vmhost"

Get-VMHost $vmhost | Get-VMHostService | Where { $_.Key -eq “sfcbd-watchdog” } | Restart-VMHostService -Confirm:$false | Out-Null

Start-Sleep -Seconds 15

Write-Host "Starting to refresh HW info on $vmhost (this can take a while)"

# Define variables for system calls

$hv = Get-View $vmhost

$hss = get-view $hv.ConfigManager.HealthStatusSystem

Write-Host "Resetting HW Sensors..."

$hss.ResetSystemHealthInfo()

Start-Sleep -Seconds 15

Write-Host "Refreshing Data..."

$hss.RefreshHealthStatusSystem()

Start-Sleep -Seconds 15

Write-Host "Refreshing Data View..."

$hss.UpdateViewData()

Start-Sleep -Seconds 15

}

Catch [System.Exception]

{

Write-Host "There was an error while trying to refresh the hardware data." `n `

"Please check the ESXi host's Hardware Status Tab." -ForegroundColor 'Red'

}

Finally

{

Write-Host "Disconnecting from the vCenter Server..."

Disconnect-VIServer $vcenter -Confirm:$false

Write-Host "Done processing $vmhost." -ForegroundColor Green

}

Stop by my blog if you'd like 🙂 I dabble in vSphere troubleshooting, PowerCLI scripting and NetApp storage - and I share my journeys at http://vmxp.wordpress.com/

vhenryanchante · ‎12-11-2014

Hola, te cuento que yo he tenido el mismo problema, pues debes de realizar lo siguiente por que el mio lo he solucionado asi y bastante logica.

Procura que la hora de tu Host Hardware sea la correcta, modifica y guardala, la hora de tu vcenter server, la hora del software esxi ntp este bien sincronizada....

despues realizas esto y eliminas la alarma...

To clear the IPMI System Event.log file and reset the sensors:

Open vCenter Server using vSphere Client.
In the vCenter Inventory, select the ESXi/ESX host.
Click the Hardware Status tab.
Click System Event log under View.
Click Reset Event Log. The red alert is removed from the System Event log.
Click Reset Sensors to reset the host sensors.

saludos y me cuentas......

JahSoldier · ‎01-08-2015

This is what I saw prior to the spam of IPMI events:

2015-01-06T20:24:10Z watchdog-sensord: '/usr/lib/vmware/bin/sensord ++min=0,max=10' exited after 7092587 seconds 1

2015-01-06T20:24:10Z watchdog-sensord: Executing '/usr/lib/vmware/bin/sensord ++min=0,max=10'

2015-01-06T20:24:12Z watchdog-sensord: '/usr/lib/vmware/bin/sensord ++min=0,max=10' exited after 2 seconds (quick failure 1) 0

2015-01-06T20:24:12Z watchdog-sensord: Executing '/usr/lib/vmware/bin/sensord ++min=0,max=10'

2015-01-06T20:24:15Z watchdog-sensord: '/usr/lib/vmware/bin/sensord ++min=0,max=10' exited after 3 seconds (quick failure 2) 0

2015-01-06T20:24:15Z watchdog-sensord: End '/usr/lib/vmware/bin/sensord ++min=0,max=10', failure limit reached

2015-01-06T20:30:44Z sfcb-vmware_raw[10147]: IpmiIfcSelReadEntry: data length mismatch req=19,resp=15

2015-01-06T20:30:47Z sfcb-vmware_raw[10147]: IpmiIfcSelReadEntry: data length mismatch req=19,resp=3

2015-01-06T20:30:49Z sfcb-vmware_raw[10147]: IpmiIfcSelReadEntry: EntryId mismatch req=0001,resp=0002

2015-01-06T20:30:50Z sfcb-vmware_raw[10147]: IpmiIfcSelReadEntry: EntryId mismatch req=0001,resp=0002

2015-01-06T20:31:14Z sfcb-vmware_raw[10147]: IpmiIfcSelReadEntry: data length mismatch req=19,resp=15

2015-01-06T20:31:15Z sfcb-vmware_raw[10147]: IpmiIfcSelReadEntry: data length mismatch req=19,resp=3

2015-01-06T20:31:18Z sfcb-vmware_raw[10147]: IpmiIfcSelReadEntry: EntryId mismatch req=0001,resp=0002

2015-01-06T20:31:20Z sfcb-vmware_raw[10147]: IpmiIfcSelReadEntry: EntryId mismatch req=0001,resp=0002

Restarting the watchdog as noted above did the trick

jitla1971 · ‎02-04-2015

can the restart of the sfcbd-watchdog be done during production hours, without impact on the vm's running on the host??

Or would it best to migrate vm's off the host and then do the restart?

Thank you

Alistar · ‎02-04-2015

Hi,

restart of this service is non-disruptive so you can do it at any time.

Stop by my blog if you'd like 🙂 I dabble in vSphere troubleshooting, PowerCLI scripting and NetApp storage - and I share my journeys at http://vmxp.wordpress.com/

coolsport00 · ‎02-25-2015

This is one that's been bugging me for a while. I've just ignored it cuz I've had other more 'pressing' issues to deal with than bother with this error. It finally bugged me enough, and things have calmed down around me enough to where I can now focus on resolving this nit-pick issue. What worked for me is what Wh33ly posted (SSH to problem Host(s) & view the events: localcli hardware ipmi sel list ; then if any show, clear the events: localcli hardware ipmi sel clear ; then restart the hardware service: /etc/init.d/sfcbd-watchdog restart ).

Thanks!
@coolsport00 (Shane)

Wh33ly

tiagodeaviz · ‎03-01-2015

Thanks for sharing your tip. I had this issue with an IBM x3650 M4 after I upgraded the IMM. Restarting services.sh did it.