pwolf
Enthusiast
Enthusiast

Hardware status on ESXi 5.0U1 with HP CIM providers needs a sfcbd-watchdog restart for correct functioning

After the sfcbd processes are restarted via sfcbd-watchdog restart I get a correct and complete hardware status. This status looses all sensors over time so that after 1 to 2 days you don't get any hadware status. A restart of sfcbd fixes this for the next  time and the process starts over again.

In the syslog I see a bunch of these messages, which are filling up the log:

2012-10-08T22:33:37Z sfcb-ProviderManager[183541]: SendMsg sending to 1 183541-9 Bad file descriptor
2012-10-08T22:33:37Z sfcbd[183561]: rcvMsg receiving from -1 183561-9 Bad file descriptor
2012-10-08T22:33:37Z sfcbd[183561]: Error getting provider context from provider manager: 9 (183561)
2012-10-08T22:33:41Z sfcbd[183561]: Error opening socket pair for getProviderContext: Too many open files
2012-10-08T22:33:41Z sfcbd[183561]: Failed to set recv timeout (30) for socket -1. Errno = 9
2012-10-08T22:33:41Z sfcbd[183561]: Failed to set timeout for local socket (e.g. provider)
2012-10-08T22:33:41Z sfcbd[183561]: spGetMsg receiving from -1 183561-9 Bad file descriptor
2012-10-08T22:33:41Z sfcbd[183561]: rcvMsg receiving from -1 183561-9 Bad file descriptor
2012-10-08T22:33:41Z sfcb-ProviderManager[183541]: SendMsg sending to 1 183541-9 Bad file descriptor
2012-10-08T22:33:46Z sfcbd[183561]: Error opening socket pair for getProviderContext: Too many open files
2012-10-08T22:33:46Z sfcbd[183561]: Failed to set recv timeout (30) for socket -1. Errno = 9
2012-10-08T22:33:46Z sfcbd[183561]: Failed to set timeout for local socket (e.g. provider)
2012-10-08T22:33:46Z sfcbd[183561]: spGetMsg receiving from -1 183561-9 Bad file descriptor
2012-10-08T22:33:46Z sfcbd[183561]: rcvMsg receiving from -1 183561-9 Bad file descriptor
2012-10-08T22:33:46Z sfcb-ProviderManager[183541]: SendMsg sending to 1 183541-9 Bad file descriptor
2012-10-08T22:33:46Z sfcbd[183561]: Error getting provider context from provider manager: 9 (183561)
2012-10-08T22:33:46Z sfcbd[183561]: Error opening socket pair for getProviderContext: Too many open files
2012-10-08T22:33:46Z sfcbd[183561]: Failed to set recv timeout (30) for socket -1. Errno = 9
2012-10-08T22:33:46Z sfcbd[183561]: Failed to set timeout for local socket (e.g. provider)
2012-10-08T22:33:46Z sfcbd[183561]: spGetMsg receiving from -1 183561-9 Bad file descriptor
2012-10-08T22:33:46Z sfcb-ProviderManager[183541]: SendMsg sending to 1 183541-9 Bad file descriptor
2012-10-08T22:33:46Z sfcbd[183561]: rcvMsg receiving from -1 183561-9 Bad file descriptor
2012-10-08T22:33:46Z sfcbd[183561]: Error getting provider context from provider manager: 9 (183561)
2012-10-08T22:33:50Z sfcbd[183561]: Error opening socket pair for getProviderContext: Too many open files
2012-10-08T22:33:50Z sfcbd[183561]: Failed to set recv timeout (30) for socket -1. Errno = 9
2012-10-08T22:33:50Z sfcbd[183561]: Failed to set timeout for local socket (e.g. provider)
2012-10-08T22:33:50Z sfcbd[183561]: spGetMsg receiving from -1 183561-9 Bad file descriptor
2012-10-08T22:33:50Z sfcbd[183561]: rcvMsg receiving from -1 183561-9 Bad file descriptor
2012-10-08T22:33:50Z sfcb-ProviderManager[183541]: SendMsg sending to 1 183541-9 Bad file descriptor
2012-10-08T22:33:53Z sfcbd[183561]: Error opening socket pair for getProviderContext: Too many open files
2012-10-08T22:33:53Z sfcbd[183561]: Failed to set recv timeout (30) for socket -1. Errno = 9
2012-10-08T22:33:53Z sfcbd[183561]: Failed to set timeout for local socket (e.g. provider)
2012-10-08T22:33:53Z sfcbd[183561]: spGetMsg receiving from -1 183561-9 Bad file descriptor
2012-10-08T22:33:53Z sfcbd[183561]: rcvMsg receiving from -1 183561-9 Bad file descriptor

This starts some minutes after restart of sfcbd and continues till the next restart.

How can I find the offending cim provider, if that is the case - or is this a problem of too many open files?

I am running ESXi 5.0.0 build 821926  with updates up to ESXi500-201209403BG and HP ESXi 5.0 Management Bundle 1.3-12 and hpnmi for ESXi 5.0 bundle 2.1-2 .

Besides of this the server has no other issues, but that is really more than annoying.

Regards

0 Kudos
15 Replies
MKguy
Virtuoso
Virtuoso

I'm running HP CIM bundle 1.2 and ESXi without the patches released 2 weeks ago on DL G5 and G7 servers and didn't encounter your issue (ran several weeks), just now updated to CIM bundle 1.3 and the patches on a test system.

Does it just occur since updating the HP bundle or host?

But what I did notice before and still do after updating, is that sfcb processes do have a lot of open files (corresponding to your "Too many open files" message).

sfcb related processes account for roughly half of total opened files on all my hosts, with a total of 3300-3800 open files:

# lsof -nPV | wc -l
3353

# lsof -nPV | awk {'count[$2]++}END{for(i in count)print count[i], i'} | sort -n
[...]
83 vpxa-worker
85 vobd
93 vmsyslogd
101 vmx
106 sfcb-ProviderMa
107 sfcb-HTTP-Daemo
107 sfcb-HTTPS-Daem
112 sfcb-vmware_aux
124 sfcb-vmware_raw
129 sfcb-hhrc
130 sfcb-smx-intero
136 sfcb-vmware_int
175 sfcb-vmware_bas
177 sfcb-pycim
231 sfcb-smx
255 hostd-worker
323 sh

Can you compare that to your hosts with the issue?

This may be only resolvable by HP, providing the CIM provider bundle.

-- http://alpacapowered.wordpress.com
0 Kudos
pwolf
Enthusiast
Enthusiast

Your picture of open file usage is similar to mine - the excess of files is opened by sfcbd processes - see below:

~ # lsof -nPV | wc -l
5084

~ # lsof -nPV | awk {'count[$2]++} END {for (i in count) print count[i], i'} | sort -n

.

.

.

.

85 vobd
88 vmsyslogd
169 sfcb-ProviderMa
171 sfcb-HTTP-Daemo
171 sfcb-HTTPS-Daem
176 sfcb-vmware_aux
188 sfcb-vmware_raw
193 sfcb-hhrc
194 sfcb-vmware_int
197 sfcb-smx-intero
240 sh
242 sfcb-pycim
249 sfcb-vmware_bas
289 hostd-worker
462 sfcb-smx
573 sfcb-CIMXML-Pro
613 vmx

It happened after upgrading the host and the CIM-providers - but from 4.x directly to 5.0U1. Maybe there are some files missing in the latest bundle

BTW I tested this on my SLES 11 virtual machines - there is only one sfcbd process - but it has far more files open than any other process - and sfcbd is sometimes troublesome on those machines, too. So the number of open files seems to be a problem of sfcbd per se and nothing special for the vmware sfcbd.

Do you think it could help to uninstal the HP-bundle and reinstall it?

0 Kudos
MKguy
Virtuoso
Virtuoso

I have no idea if that could help to be honest, you could give it a try though.

Other than that, the only workaround I can think if for your issue is scheduling a restart of the service via a local cron script or remotely via PowerCLI or something.

-- http://alpacapowered.wordpress.com
0 Kudos
pwolf
Enthusiast
Enthusiast

Reinstall did not help.

In the meanwhile I have uninstalled/removed hp-smx-provider and query the array via cli and use the builtin providers for other hardware status information.

To get things working I will probably have to dig into the registration files in /var/lib/sfcb/ and disable some probe/sensor - but have absolutely no idea which one.

I guess it may have to do with non HP ethernet cards or a HP SAS HBA card (which is not used for storage), which was supported till vmware 4 but now not.

Unfortunately one cannot stick with one version of ESXi during the physical lifespan of a server, if you have Linux servers under your guests as the support for newer kernels is not backported to older vmware versions. In the last 3 years Vmware went from ESX 3.5 to ESXi 5.1 that are 5 versions of the software in 3 years and especially since ESXi hardware support is reduced for older hardware at each version switch. And Linux kernels went -e.g. in the SLES distribution from 2.6.16 to 3.0.38.

As one is not always at the leading edge of hardware and software the support for new OS(kernels) in older ESX/ESXi versions should be extended - and/or - the support of older hardware should be ported to newer versions of ESXi. Otherwise we will either have to shorten our hardware renewal cycle - which can get quite expensive - or look for alternatives, which I would not like to do.

0 Kudos
MKguy
Virtuoso
Virtuoso

I might have just experienced your issue firsthand too, though I couldn't find any logs similar to yours. The only relevant entries I found were these syslog.log:

# grep -i sfcbd /scratch/log/*.log

[...]

syslog.log:2012-10-12T13:50:01Z sfcb-vmware_base[3504]: Timeout (or other socket error) sending request to provider
syslog.log:2012-10-12T13:50:01Z sfcb-vmware_base[3504]: Request Header Id (2303) != Response Header reqId (0) in request to provider 429 in process 2. Drop response.
syslog.log:2012-10-12T13:50:01Z sfcb-vmware_base[3504]: Dropped response operation details -- nameSpace: root/cimv2, className: VMware_StorageVolume, Type: 0
syslog.log:2012-10-12T13:50:33Z sfcbd[3512]: Timeout (or other socket error) sending request to provider
syslog.log:2012-10-12T13:50:33Z sfcbd[3512]: Request Header Id (678) != Response Header reqId (0) in request to provider 92 in process 2. Drop response.
syslog.log:2012-10-12T13:50:33Z sfcbd[3512]: Dropped response operation details -- nameSpace: root/hpq, className: SMX_AutoStartEthernetPort, Type: 0
syslog.log:2012-10-12T13:51:03Z sfcb-vmware_base[3504]: Timeout (or other socket error) sending request to provider
syslog.log:2012-10-12T13:51:03Z sfcb-vmware_base[3504]: Request Header Id (2313) != Response Header reqId (0) in request to provider 410 in process 2. Drop response.
syslog.log:2012-10-12T13:51:03Z sfcb-vmware_base[3504]: Dropped response operation details -- nameSpace: root/cimv2, className: VMware_StorageExtent, Type: 0
syslog.log:2012-10-12T13:51:34Z sfcbd[3512]: Timeout (or other socket error) sending request to provider
syslog.log:2012-10-12T13:51:34Z sfcbd[3512]: Request Header Id (679) != Response Header reqId (0) in request to provider 228 in process 2. Drop response.
syslog.log:2012-10-12T13:51:34Z sfcbd[3512]: Dropped response operation details -- nameSpace: root/hpq, className: SMX_AutoStartPCI, Type: 0
syslog.log:2012-10-12T13:51:44Z sfcb-vmware_int[3481]: Problem processing indication to http://127.0.0.1:49152. sfcb rc: 4 CURL error: 28 (Timeout was reached)
syslog.log:2012-10-12T13:55:20Z sfcbd-watchdog: failed query: /bin/prop_of_instances root/cimv2 DELL_EqlHostConnectionManager Status
syslog.log:2012-10-12T13:55:49Z sfcbd[3512]: --- spSendReq/spSendMsg failed to send on 7 (-1)
syslog.log:2012-10-12T13:56:19Z sfcbd[3512]: --- spSendReq/spSendMsg failed to send on 7 (-1)
syslog.log:2012-10-12T13:56:50Z sfcbd[3512]: --- spSendReq/spSendMsg failed to send on 7 (-1)
syslog.log:2012-10-12T13:57:28Z sfcbd[3512]: --- spSendReq/spSendMsg failed to send on 7 (-1)
syslog.log:2012-10-12T13:58:00Z sfcbd[3512]: --- spSendReq/spSendMsg failed to send on 7 (-1)
syslog.log:2012-10-12T13:58:32Z sfcbd[3512]: --- spSendReq/spSendMsg failed to send on 7 (-1)
syslog.log:2012-10-12T13:59:04Z sfcbd[3512]: --- spSendReq/spSendMsg failed to send on 7 (-1)
syslog.log:2012-10-12T13:59:38Z sfcbd[3512]: --- spSendReq/spSendMsg failed to send on 7 (-1)
syslog.log:2012-10-12T14:00:10Z sfcbd[3512]: --- spSendReq/spSendMsg failed to send on 7 (-1)
syslog.log:2012-10-12T14:00:42Z sfcbd[3512]: --- spSendReq/spSendMsg failed to send on 7 (-1)
syslog.log:2012-10-12T14:01:14Z sfcbd[3512]: --- spSendReq/spSendMsg failed to send on 7 (-1)
syslog.log:2012-10-12T14:01:34Z sfcb-CIMXML-Processor[17760]: Timeout (or other socket error) waiting for response from provider
syslog.log:2012-10-12T14:02:05Z sfcbd[3512]: --- spSendReq/spSendMsg failed to send on 7 (-1)
syslog.log:2012-10-12T14:02:35Z sfcb-CIMXML-Processor[18378]: --- spSendReq/spSendMsg failed to send on 7 (-1)
syslog.log:2012-10-12T14:03:18Z sfcbd[3512]: --- spSendReq/spSendMsg failed to send on 7 (-1)
syslog.log:2012-10-12T14:03:48Z sfcb-vmware_int[3481]: --- spSendReq/spSendMsg failed to send on 7 (-1)
syslog.log:2012-10-12T14:04:18Z sfcbd[3512]: --- spSendReq/spSendMsg failed to send on 7 (-1)
syslog.log:2012-10-12T14:04:50Z sfcbd[3512]: --- spSendReq/spSendMsg failed to send on 7 (-1)
syslog.log:2012-10-12T14:05:20Z sfcb-CIMXML-Processor[18494]: --- spSendReq/spSendMsg failed to send on 7 (-1)
syslog.log:2012-10-12T14:05:52Z sfcbd[3512]: --- spSendReq/spSendMsg failed to send on 7 (-1)
syslog.log:2012-10-12T14:06:02Z sfcbd-watchdog: failed query: /bin/prop_of_instances root/cimv2 DELL_EqlHostConnectionManager Status
syslog.log:2012-10-12T14:06:22Z sfcb-CIMXML-Processor[18574]: --- spSendReq/spSendMsg failed to send on 7 (-1)
syslog.log:2012-10-12T14:06:22Z sfcb-CIMXML-Processor[18574]: SFCB restart requested.
syslog.log:2012-10-12T14:06:52Z sfcb-CIMXML-Processor[18494]: --- spSendReq/spSendMsg failed to send on 7 (-1)
syslog.log:2012-10-12T14:07:02Z sfcbd-watchdog: sfcb restart due to restart request
syslog.log:2012-10-12T14:07:02Z sfcbd-watchdog: stopping sfcbd pid
syslog.log:2012-10-12T14:07:02Z sfcbd: Sending TERM signal to sfcbd
syslog.log:2012-10-12T14:07:22Z sfcbd[3512]: --- spSendReq/spSendMsg failed to send on 7 (-1)

syslog.log:2012-10-12T14:07:22Z sfcbd: Sending TERM signal to sfcbd
syslog.log:2012-10-12T14:07:43Z sfcbd: Sending TERM signal to sfcbd
syslog.log:2012-10-12T14:08:03Z sfcbd: Sending TERM signal to sfcbd
syslog.log:2012-10-12T14:08:23Z sfcbd: Sending TERM signal to sfcbd
syslog.log:2012-10-12T14:08:43Z sfcbd: Sending TERM signal to sfcbd
syslog.log:2012-10-12T14:09:03Z sfcbd: Sending TERM signal to sfcbd
syslog.log:2012-10-12T14:09:23Z sfcbd: Sending TERM signal to sfcbd
syslog.log:2012-10-12T14:09:43Z sfcbd: Sending TERM signal to sfcbd
syslog.log:2012-10-12T14:10:03Z sfcbd: Sending TERM signal to sfcbd
syslog.log:2012-10-12T14:10:23Z sfcbd: Sending TERM signal to sfcbd
syslog.log:2012-10-12T14:10:44Z sfcbd: Sending TERM signal to sfcbd
syslog.log:2012-10-12T14:11:04Z sfcbd: Sending TERM signal to sfcbd
syslog.log:2012-10-12T14:11:24Z sfcbd: Sending TERM signal to sfcbd
syslog.log:2012-10-12T14:11:44Z sfcbd: Sending TERM signal to sfcbd
syslog.log:2012-10-12T14:12:04Z sfcbd: Sending TERM signal to sfcbd
syslog.log:2012-10-12T14:12:24Z sfcbd: Sending TERM signal to sfcbd
syslog.log:2012-10-12T14:12:44Z sfcbd: Sending TERM signal to sfcbd
syslog.log:2012-10-12T14:12:44Z sfcb-CIMXML-Processor[18494]: SendMsg sending to 7 18494-4 Interrupted system call
syslog.log:2012-10-12T14:12:44Z sfcb-CIMXML-Processor[18494]: spSendMsg sending to 7 18494-4 Interrupted system call
syslog.log:2012-10-12T14:12:44Z sfcb-CIMXML-Processor[18494]: --- spSendReq/spSendMsg failed to send on 7 (-1)
syslog.log:2012-10-12T14:12:44Z sfcb-CIMXML-Processor[18494]: SFCB restart requested.
syslog.log:2012-10-12T14:13:04Z sfcbd: Sending TERM signal to sfcbd
syslog.log:2012-10-12T14:13:24Z sfcbd: Sending TERM signal to sfcbd
syslog.log:2012-10-12T14:13:45Z sfcbd: Sending TERM signal to sfcbd
syslog.log:2012-10-12T14:14:05Z sfcbd: Sending TERM signal to sfcbd
syslog.log:2012-10-12T14:14:25Z sfcbd: Sending TERM signal to sfcbd
syslog.log:2012-10-12T14:14:45Z sfcbd: Sending TERM signal to sfcbd
syslog.log:2012-10-12T14:15:05Z sfcbd: Sending TERM signal to sfcbd
syslog.log:2012-10-12T14:15:25Z sfcbd: Sending TERM signal to sfcbd
syslog.log:2012-10-12T14:15:45Z sfcbd: Sending TERM signal to sfcbd
syslog.log:2012-10-12T14:16:05Z sfcbd: Sending TERM signal to sfcbd
syslog.log:2012-10-12T14:16:26Z sfcbd: Sending TERM signal to sfcbd
syslog.log:2012-10-12T14:16:46Z sfcbd: Sending TERM signal to sfcbd
syslog.log:2012-10-12T14:17:06Z sfcbd: Stopping sfcbd
syslog.log:2012-10-12T14:17:06Z sfcb-ProviderManager[3477]: SendMsg sending to 72 3477-32 Broken pipe
syslog.log:2012-10-12T14:17:26Z sfcbd-watchdog: Reached max kill attemtps. watchdog is exiting
syslog.log:2012-10-15T08:50:13Z sfcbd-watchdog: Terminating watchdog process with PID 3175
(my manual restart)
syslog.log:2012-10-15T08:50:13Z sfcbd-watchdog: stopping sfcbd pid
syslog.log:2012-10-15T08:50:13Z sfcbd: Stopping sfcbd
syslog.log:2012-10-15T08:50:13Z sfcbd-watchdog: Watchdog active: interval 60 seconds, pid 45898
syslog.log:2012-10-15T08:50:13Z sfcbd-watchdog: starting sfcbd
syslog.log:2012-10-15T08:50:13Z sfcbd: Starting sfcbd
syslog.log:2012-10-15T08:50:14Z sfcb-sfcb[46083]: --- Log syslog level: 3
syslog.log:2012-10-15T08:50:19Z cimslp: --- Using /etc/sfcb/sfcb.cfg

[...]

The host I previously updated to the recent CIM providers and ESXi  patches didn't run many VMs, so it didn't have as many vmx-processes and  total open files (lsof | wc -l > yielding 5500 total  instead of ~3500 now).

I also noticed right after installing the updates, that the "Storage" section was missing hardware status tab and the "Software Components" section on the  didn't show the P410i SmartArray firmware version. It works now though.

I'll update another host with only the ESXi patches but not the 1.3 CIM bundle now and see how that fares.

Edit 2012/10/22:

Seems to have been some other one-time random issue, I updated most hosts to the recent bundle and ESXi-patches but didn't run into any problems since about a week.

-- http://alpacapowered.wordpress.com
0 Kudos
RedShift2
Contributor
Contributor

I have the same problems, sooner or later sfcbd fails to respond. I'm pretty sure its the HP software that's buggy, leaving open file descriptors or leaking memory, etc...

The problem is easily reproducible if you query sfcbd using wbemcli a few times quickly after each other. It fails to respond after about 10 queries.

I'm running ESXi 5.1 here, also using the latest HP management agents (installed using the HP ESXi customized ISO), etc...

I suspect sfcbd is hitting the open file limit:

/var/log # ulimit -Hn
1024

0 Kudos
pwolf
Enthusiast
Enthusiast

Yes it's pretty sure, that the error is part of the HP software. It is seemingly dependent on the physical hardware of the host machine if this error occurs or not.

But as is is quite complicated to change the monitoring settings of this software, the determination of the offending piece could only be done on a test machine, where one physically removes everything from the machine, which is unneeded to run the ESXi host software and then add piece after piece, till failure occurs. I do not have such an identical physical test machine and therefore cannot afford enough downtime to check the software this way and run this host simply without the HP providers. This is annoying, because I have to check the array status via CLI, but still better than a self destructing sfcbd.

0 Kudos
JoVa
Contributor
Contributor

Hello,

I'm also having the same problem.

pastedImage_0.pngpastedImage_0.png

I have 3 HP ML350 G8 servers running ESXi 5.1 HP branded.  When I check the hardware Status via the WebInterface (VSphere Web Client), I have one server which is not responding.  All server have identical software and upgrades installed.

Stopped and restarted the watchdog via commandline, but no luck.

In the syslog of the server I can see an error about query, which times out after a while:

pastedImage_1.png

Strange is that there's some info about DELL as I have HP servers. 😉

Only difference is that on the other two servers, multiple hosts are connected and on the faulty server only one host is connected

I need to resolve this quickly for a demonstration during training session.

Tonight I'm rebooting the server, and see if this resolves anything. (server is running up to 300 days now)

0 Kudos
JoVa
Contributor
Contributor

Finally rebooted my system.

For now this seemed to do the trick

0 Kudos
virtualeric
Contributor
Contributor

I ran into the same problem as TS after installing the HP ISO of ESXi 5.5 hypervisor on a Proliant DL360 G5. Health status is ok for a short time, but storage disappears with log errors (sfcbd: too many open files) appearing.

I tried to add a cron job restarting the sfcbd process every half hour. This worked satisfactory, but I didn't succeed in making that reboot-proof - looks like esxi 5.5 does not allow that. Since I am new to both ESXi and HP hardware I am pretty much out of options for now.

Edit: I finally managed to implement the half-hour restart of the sfcbd-watchdog as desired, so for now my problem is solved. I have now put all script code to handle things in /etc/rc.local.d/local.sh instead of /etc/rc.local.sh and this file is not overwritten during reboot.

0 Kudos
Macomar
Contributor
Contributor

Hello virtualeric,

i have the same problem with my DL380 Server (G5/G6/G7). We monitor the servers over nagios and sometime the health status changes to "unknown".

So we manually watch the syslog.log and restart the watchdog service over the console.

Would you please give me your scipt ? I think you made a cronjob on the esx to run ist perhaps once a day, right ?

Would it be possible, that the script checks the status of the watchdog service and restarts it, when its down ?

Thank you very much

Macomar

0 Kudos
virtualeric
Contributor
Contributor

Hi Macomar,

This is how /etc/rc.local.d/local.sh looks like to restart the service every 30mins.

# script that is not part of a stable API (relying on files to be in
# specific places, specific tools, specific output, etc) there is a
# possibility you will end up with a broken system after patching or
# upgrading.  Changes are not supported unless under direction of
# VMware support.

echo "chkconfig sfcbd-watchdog off" > /usr/sbin/sfcbd-restart.sh
echo "chkconfig sfcbd off" >> /usr/sbin/sfcbd-restart.sh
echo "/etc/init.d/sfcbd-watchdog stop" >> /usr/sbin/sfcbd-restart.sh
echo "chkconfig sfcbd-watchdog on" >> /usr/sbin/sfcbd-restart.sh
echo "chkconfig sfcbd on" >> /usr/sbin/sfcbd-restart.sh
echo "/etc/init.d/sfcbd-watchdog start" >> /usr/sbin/sfcbd-restart.sh

chmod 755 /usr/sbin/sfcbd-restart.sh

kill $(cat /var/run/crond.pid)

cat /var/spool/cron/crontabs/root > /var/spool/cron/crontabs/rootx
echo "*/30 * * * * /usr/sbin/sfcbd-restart.sh" >> /var/spool/cron/crontabs/rootx
rm /var/spool/cron/crontabs/root
cp /var/spool/cron/crontabs/rootx /var/spool/cron/crontabs/root

chmod 1444 /var/spool/cron/crontabs/root
crond

exit 0

Script has worked well since november, no problem to restart the service every half hour. Checking if the service actually needs restarting is, if at all possible, beyond my skills level.

hth,

Eric

0 Kudos
RGE
Contributor
Contributor

Hi,

I have similar problems using HP 360 G6 servers, running ESXi 5.5.0 174018 (VMware ESXi 5.5 U1 Installable HP Customized ISO Image) but with the latest versions Sensors are working fine and CIM server doesn'l crash as long as Direct IO path isn't used.

As soons as 1 or 2 network cards are setup in Direct IO Path, CIM server crashes regulary and doesn't restart at first try...

Does anybody have any new information about this problems ?

Regards,

Raphael

0 Kudos
Macomar
Contributor
Contributor

Ich befinde mich vom 18.08.2014 bis 05.09.2014 in Urlaub.

Meine Email werden nicht weitergeleitet oder gelesen.

Bitte wenden Sie sich an meine Kollegen vom ISD OnSite Team Telefon 0621 60-79185

oder per Email an ONSITESERVICEBASF@SD.ISD.DE

-


I am from the 18/08/2014 to 05/09/2014 on vacation.

My emails are not forwarded or are read.

Please, turn to my colleagues of the ISD OnSite Team BASF Phone 0621 60-79185

or via Email at ONSITESERVICEBASF@SD.ISD.DE

0 Kudos
BrianDVS
Contributor
Contributor

Is there a log that i can view that will give me an indication that the service has been stopped and restarted?

0 Kudos