Exhausting inodes + Disconnected Host

jrmunday · ‎12-31-2012

Hi All,

I just had an interesting issue, and though I would share this as it might save you co-ordinating some planned downtime which could potentially be avoided.

We had a disconnected host where the guest VM's are all still running and can be accessed via RDP, but the host is not responsive through the iLO / DCUI and SSH is not running and can't be started.

The host logged the following sequential events in vCenter;

The root filesystem's file table is full. As a result, the file tmp:/auto-backup.1481830/etc/hosts could not be created by the application 'tar'.
The root filesystem's file table is full. As a result, the file tmp:/auto-backup.1482016/etc/sfcb/repository/root/interop/cim_listenerdestinationcimxml.idx could not be created by the application 'tar'.
The root filesystem's file table is full. As a result, the file tmp:/auto-backup.1482194/etc/vmware/hostd/vmAutoStart.xml could not be created by the application 'tar'.
The root filesystem's file table is full. As a result, the file /etc/vmware/esx.conf.LOCK.17554 could not be created by the application 'hostd-worker'.
The root filesystem's file table is full. As a result, the file /var/log/ipmi/0/.sensor_threshold.raw could not be created by the application 'sfcb-vmware_raw'.
The root filesystem's file table is full. As a result, the file /var/log/ipmi/0/.sensor_hysteresis.raw could not be created by the application 'sfcb-vmware_raw'.
The root filesystem's file table is full. As a result, the file /var/run/sfcb/52c25dd2-064a-abee-ce4c-cafd051d527c could not be created by the application 'sfcb-CIMXML-Pro'.
The root filesystem's file table is full. As a result, the file /var/log/ipmi/0/.sel_header.raw could not be created by the application 'sfcb-vmware_raw'.
The root filesystem's file table is full. As a result, the file /var/run/sfcb/52ca5a12-1d8d-7902-1e14-170d2c282951 could not be created by the application 'sfcb-CIMXML-Pro'.
The root filesystem's file table is full. As a result, the file /var/log/ipmi/0/.sensor_readings.raw could not be created by the application 'sfcb-vmware_raw'.
The root filesystem's file table is full. As a result, the file /etc/vmware/esx.conf.LOCK.17554 could not be created by the application 'hostd-worker'.
Unable to apply DRS resource settings on host. A general system error occurred: Invalid fault. This can significantly reduce the effectiveness of DRS.
The root filesystem's file table is full. As a result, the file /var/run/sfcb/523777d0-72dc-9e0b-c6b0-9d32a5255317 could not be created by the application 'sfcb-CIMXML-Pro'.
The root filesystem's file table is full. As a result, the file /var/run/sfcb/52fc39a4-62d0-866e-50a3-663209c9ca28 could not be created by the application 'sfcb-CIMXML-Pro'.
The vSphere HA availability state of this host has changed to Unreachable
Host is not responding
Alarm 'Host connection state' on myhost.mydomain changed from Green to Red
Alarm 'Host connection state' on myhost.mydomain sent email to myemail@mydomain
vSphere HA agent for this host has an error: The vSphere HA agent is not reachable from vCenter Server
Alarm 'vSphere HA host status' on myhost.mydomain changed from Green to Red
vSphere HA agent for this host has an error: The vSphere HA agent is not reachable from vCenter Server
Cannot scan the host myhost.mydomain because its power state is unknown.
Host is not responding

I found this KB article, but was unable to start the process as I couldn't SSH onto the host;

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=203779...

Since I knew which guests were running on the affected host, I contacted the business and arranged emergency downtime to shut these guests down so that I could power cycle the host and deal with the issue. After lots of co-ordination we finally agreed on a suitable time which satisfied all business areas, and started the remediation.

Now here is the interesting part ... within seconds of shutting down guest VM's with a simple for loop and the shutdown command the host staus changed to Green and was connected to vCenter again.

for /f %i in (C:\_temp\targets.txt) do shutdown -s -m \\%i -t 0 -f

I enabled SSH and ran "stat -f /" - results below;

~ # stat -f /
File: "/"
    ID: 1        Namelen: 127     Type: visorfs
Block size: 4096
Blocks: Total: 449852     Free: 324368     Available: 324368
Inodes: Total: 8192       Free: 55

After running throught the above mentioned KB article, the inodes were still exhausted;

/var/run/sfcb # stat -f /
File: "/"
    ID: 1        Namelen: 127     Type: visorfs
Block size: 4096
Blocks: Total: 449852     Free: 324565     Available: 324565
Inodes: Total: 8192       Free: 122

So now that the host was available again I put it into Maintenance mode, rebooted it and checked again after the reboot (plenty of free inodes);

~ # stat -f /
File: "/"
    ID: 1        Namelen: 127     Type: visorfs
Block size: 4096
Blocks: Total: 449852     Free: 332942     Available: 332942
Inodes: Total: 8192       Free: 5721

All VM's that were shutdown were now powered up using PowerCLI.

So the interesting point that could potentially be taken from this is that next time this issue occurs, I might be able to resolve the issue by shutting down one or more running VM's without affecting all guest VM's ... so perhaps shutdown the lowest priority non-production VM's first to see if this frees up enough inodes to get the host responsive again.

So two questions;

Is this logic flawed?
Is there a method to monitor FREE inodes so that this can be caught in advance of it becoming and issue involving downtime?

Cheers, & happy new year!

Jon

vExpert 2014 - 2022 | VCP6-DCV | http://www.jonmunday.net | @JonMunday77

vmroyale · ‎12-31-2012

Note: Discussion successfully moved from VMware ESXi 5 to Availability: HA & FT

jrmunday · ‎01-03-2013

Bumping to the top ... Thanks.

vExpert 2014 - 2022 | VCP6-DCV | http://www.jonmunday.net | @JonMunday77

depping · ‎01-06-2013

Typically you should try to clean up certain directories. You will need to restart the management services in most cases before it becomes back again in vCenter.

Here's a KB that tells which directories to look for:

http://kb.vmware.com/kb/2037798

Not sure about monitoring it.

jrmunday · ‎01-16-2013

Hi Duncan,

Just closing this thread and feeding back some information ...

I have (with the help of the PowerCLI community) written a script to monitor this, see thread below;

http://communities.vmware.com/message/2178047#2178047

Cheers,

Jon

vExpert 2014 - 2022 | VCP6-DCV | http://www.jonmunday.net | @JonMunday77

depping · ‎01-18-2013

Awesome, thanks for sharing!