Re: IPMI / sfcbd-watchdog Freespace problem

mrudloff · ‎08-18-2010

Hiya,

I have a problem with one of our hosts.

/var/log/ipmi/0 fills up in less than a day. The files

sel

sel_raw

getting massive and I cannot seem to find a reason for it. The only way to get the host to behave seem to delete those files and restart sfcbd-watchdog

I got two questions actually

What does the service sfcbd-watchdog but more importantly, what is causing those files to grow so massively and how can I stop this ?

admin · ‎08-18-2010

sfcb daemon is running on ESX/ESXi server which monitors you server health status like processor,Memory,PowerSupply.. you can see ur server health status under configuration tab.

Job of sfcbd-watchdog is to monitor sfcb daemon, if in case sfcbd got stopped then sfcbd-watchdog restart the sfcbd daemon.

Can you please attach sel log output and health status screen shot then i may able to help you out with it ?

titanlee · ‎08-18-2010

sfcb daemon is running on ESX/ESXi server which monitors you server health status like processor,Memory,PowerSupply.. you can see ur server health status under configuration tab.
Job of sfcbd-watchdog is to monitor sfcb daemon, if in case sfcbd got stopped then sfcbd-watchdog restart the sfcbd daemon.
Can you please attach sel log output and health status screen shot then i may able to help you out with it ?

Hiya,

thank you for your reply.

Here the hardware status:

The file content of the file 'sel' (same content worth 40MB)

And 'sel.raw' (same content worth 130MB)

Edit: sorry, I was logged into the wrong account

abaum · ‎08-18-2010

We've been having problems with these two services that when the watchdog service restarts, it also restarts hostd, which causes the hosts to show up as disconnected in VC for a few minutes. Tech Support just had us disable the sfcbd-watchdog service. I used to have this problem on my HP servers and now I am seeing it on UCS. Looks like CIM/IPMI and VM don't get along.

adam

admin · ‎08-18-2010

Hi,

I saw the screen-shot and sel entires. I m suspecting there is issue with oem integration.

Have you integrated any oem with esx build? Because screen shot which you have pasted which talks about "Asset tag : To be filed by O.E.M."

Workaround is :-

In same Health status screen, there is drop down list, In that select "Sensor event log" and click on reset event log which will remove entries from sel and sel.raw.

Which build you have installed? is it fresh install or upgrade? plz let me know so I will try to reproduce this issue.

mrudloff · ‎08-18-2010

The server runs the latest possible release and it is a fresh install (4.1.0, 260247).

We run several more hosts in the same cluster and this seem to be the only one with this problem.

Even when I click reset the files start to grow immediately again. As you can see attached, even the date seem to be an odd one.

admin · ‎08-18-2010

I hope your other servers also has same hardware vendor and same esx build installed?.. May i know which hardware vendor you have?

Please check Bios firmware and BMC firmware version with other servers where you are not facing this issue... if it is not same then try to upgrade firmwares.

mrudloff · ‎08-18-2010

All three server are indentical in any way, including every firmware / bios revision.

Motherboard is a Supermicro X8DTN+-F

mrudloff · ‎08-18-2010

All three server are indentical in any way, including every firmware / bios revision.

Motherboard is a Supermicro X8DTN+-F

VirtualEquality · ‎12-15-2010

Hi,

we have exactly the same problem and also three server of the same vendor but only one with this issue. Stopping the watchdog services fixes the problem but of course it is no real solution.

Have you or somebdy else found any solution for the problem?

Thanks in advance.

aaiitsupport · ‎01-19-2011

am having the same problem with a Supermicro X8SIL-F-O. Anybody had luck in getting this working. Part of me thinks that it has something to do with the Asset Tag not being set properly. Does anybody know how to change the asset tag?

ds236 · ‎02-09-2011

Count me as yet another with this issue. The machine gets into trouble within 30 minutes of reboot. A reboot allows things to proceed as normal, then we must reboot once again.

ds236 · ‎02-09-2011

From vSphere Client, Configuration, Software/Advanced Settings, I found:

VMKernel.Boot.ipmiEnabled

I rebooted the box, then immediately went in and unchecked this setting, then rebooted again before the sel and sel.raw files again filled up the space.

After reboot, the /var/log/ipmi/0/ directory on that machine is empty.

This isn't a great work-around, as most of the sensor data now can't be monitored by VMWare, but it does keep VMWare from filling these two log files with the same information over, and over and over. I'd really like to see a fix from VMWare. It's easy: when you're doing dumb things, it's writing "efef" forever in one log and "System Boot:" over and over in the other. This can't be that hard to fix in software.

zero1 · ‎02-11-2011

I am having the exact same issue with a Supermicro X8SIL-F-O. Workaround seems to work but I hope this gets a permanent fix soon.

ds236 · ‎02-18-2011

Filling in the Asset Tag value (and for that matter any other blank values in the FRU) have ZERO impact on this issue. Looks like the issue here is a bug in ESXi as it interacts with this motherboard. I'd sure like to see this fixed, and it'll affect my purchase of vShere licensing, as we are evaluating the product set now.

Disabling IPMI entirely in the VMWare configuration is presently the only way to make these platforms functional.

That VMWare fails with cryptic messages about the disk being full, rather than properly handle the issue, gives an appearance of poor software quality, and questionable software quality assurance testing to the entire product. Surely you can do better.

acarrasco201110 · ‎02-28-2011

We have the same problem in one of the six ESX 4.1 server. It's an IBM x3650 M3.

Red alert and in System event log:

01/01/9999 1:00:00 AM OEM Defined:0xefefefefefefefefefefef

Any solution?

Thanks in advice.

acarrasco201110 · ‎02-28-2011

At IBM IMM (ILO) level, we changed the hostname and reboot IMM. Now all the problems disappeared.

bert_cl · ‎01-17-2012

problem is caused by a firmware update in the IMM (or IMM or UEFI or DSA udpate). If the ESX has not been restarted after those updates, you will get this error.

benny_hauk · ‎04-20-2012

We had the exact same symptom (ESXi 4.1, build 260247; only happened on one system; would disconnect from vCenter temporarily) and VMware tech support told us to disable the CIM agent [link]. The issue went away but we can't view hardware status, drivers versions, etc any longer. The problem occurred on an HS22v IBM bladecenter blade that was up to date firmware-wise.

Anyone heard what, if any update/patch fixed this? Also: VMKernel.Boot.ipmiEnabled fix vs. disabling the CIM agent fix... which workaround is best? If they both work, does the KMKernel.Boot.ipmiEnabled fix allow some functionality from the Hardware Status tab or does it disable everything as well?

Benny Hauk Systems Admin, VCP3/VCP4 LifeWay Chrstian Resources

ds236 · ‎04-20-2012

We moved to ESXi 5.0, and the problem no longer exists. Someone at VMWare knew about it and fixed it. Lobby for it to be patched in 4.1?