5 Replies Latest reply on Jan 18, 2013 1:03 AM by depping

    Exhausting inodes + Disconnected Host

    Jon Munday Master
    vExpert

      Hi All,

       

      I just had an interesting issue, and though I would share this as it might save you co-ordinating some planned downtime which could potentially be avoided.

       

      We had a disconnected host where the guest VM's are all still running and can be accessed via RDP, but the host is not responsive through the iLO / DCUI and SSH is not running and can't be started.

       

      The host logged the following sequential events in vCenter;

       

      The root filesystem's file table is full.  As a result, the file tmp:/auto-backup.1481830/etc/hosts could not be created by the application 'tar'.
      The root filesystem's file table is full.  As a result, the file tmp:/auto-backup.1482016/etc/sfcb/repository/root/interop/cim_listenerdestinationcimxml.idx could not be created by the application 'tar'.
      The root filesystem's file table is full.  As a result, the file tmp:/auto-backup.1482194/etc/vmware/hostd/vmAutoStart.xml could not be created by the application 'tar'.
      The root filesystem's file table is full.  As a result, the file /etc/vmware/esx.conf.LOCK.17554 could not be created by the application 'hostd-worker'.
      The root filesystem's file table is full.  As a result, the file /var/log/ipmi/0/.sensor_threshold.raw could not be created by the application 'sfcb-vmware_raw'.
      The root filesystem's file table is full.  As a result, the file /var/log/ipmi/0/.sensor_hysteresis.raw could not be created by the application 'sfcb-vmware_raw'.
      The root filesystem's file table is full.  As a result, the file /var/run/sfcb/52c25dd2-064a-abee-ce4c-cafd051d527c could not be created by the application 'sfcb-CIMXML-Pro'.
      The root filesystem's file table is full.  As a result, the file /var/log/ipmi/0/.sel_header.raw could not be created by the application 'sfcb-vmware_raw'.
      The root filesystem's file table is full.  As a result, the file /var/run/sfcb/52ca5a12-1d8d-7902-1e14-170d2c282951 could not be created by the application 'sfcb-CIMXML-Pro'.
      The root filesystem's file table is full.  As a result, the file /var/log/ipmi/0/.sensor_readings.raw could not be created by the application 'sfcb-vmware_raw'.
      The root filesystem's file table is full.  As a result, the file /etc/vmware/esx.conf.LOCK.17554 could not be created by the application 'hostd-worker'.
      Unable to apply DRS resource settings on host. A general system error occurred: Invalid fault. This can significantly reduce the effectiveness of DRS.
      The root filesystem's file table is full.  As a result, the file /var/run/sfcb/523777d0-72dc-9e0b-c6b0-9d32a5255317 could not be created by the application 'sfcb-CIMXML-Pro'.
      The root filesystem's file table is full.  As a result, the file /var/run/sfcb/52fc39a4-62d0-866e-50a3-663209c9ca28 could not be created by the application 'sfcb-CIMXML-Pro'.
      The vSphere HA availability state of this host has changed to Unreachable
      Host is not responding
      Alarm 'Host connection state' on myhost.mydomain changed from Green to Red
      Alarm 'Host connection state' on myhost.mydomain sent email to
      myemail@mydomain
      vSphere HA agent for this host has an error: The vSphere HA agent is not reachable from vCenter Server
      Alarm 'vSphere HA host status' on myhost.mydomain changed from Green to Red
      vSphere HA agent for this host has an error: The vSphere HA agent is not reachable from vCenter Server
      Cannot scan the host myhost.mydomain because its power state is unknown.
      Host is not responding

      I found this KB article, but was unable to start the process as I couldn't SSH onto the host;

      http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2037798

       

      Since I knew which guests were running on the affected host, I contacted the business and arranged emergency downtime to shut these guests down so that I could power cycle the host and deal with the issue. After lots of co-ordination we finally agreed on a suitable time which satisfied all business areas, and started the remediation.

       

      Now here is the interesting part ... within seconds of shutting down guest VM's with a simple for loop and the shutdown command the host staus changed to Green and was connected to vCenter again.

       

      for /f %i in (C:\_temp\targets.txt) do shutdown -s -m \\%i -t 0 -f

       

      I enabled SSH and ran "stat -f /" - results below;

       

      ~ # stat -f /
        File: "/"
          ID: 1        Namelen: 127     Type: visorfs
      Block size: 4096
      Blocks: Total: 449852     Free: 324368     Available: 324368
      Inodes: Total: 8192       Free: 55

       

      After running throught the above mentioned KB article, the inodes were still exhausted;


      /var/run/sfcb # stat -f /
        File: "/"
          ID: 1        Namelen: 127     Type: visorfs
      Block size: 4096
      Blocks: Total: 449852     Free: 324565     Available: 324565
      Inodes: Total: 8192       Free: 122

      So now that the host was available again I put it into Maintenance mode, rebooted it and checked again after the reboot (plenty of free inodes);

       

      ~ # stat -f /
        File: "/"
          ID: 1        Namelen: 127     Type: visorfs
      Block size: 4096
      Blocks: Total: 449852     Free: 332942     Available: 332942
      Inodes: Total: 8192       Free: 5721

       

      All VM's that were shutdown were now powered up using PowerCLI.

       

      So the interesting point that could potentially be taken from this is that next time this issue occurs, I might be able to resolve the issue by shutting down one or more running VM's without affecting all guest VM's ... so perhaps shutdown the lowest priority non-production VM's first to see if this frees up enough inodes to get the host responsive again.

       

      So two questions;

       

      1. Is this logic flawed?
      2. Is there a method to monitor FREE inodes so that this can be caught in advance of it becoming and issue involving downtime?

       

      Cheers, & happy new year!

      Jon