2 Replies Latest reply on Apr 11, 2011 6:48 PM by jasonmc

    ESX 3.5 and Red Hat EL 5.6 VM Hang on Boot

    jasonmc Lurker

      Hello all,


      Has anyone experienced issues with Red Hat EL 5.6 using kernels 2.6.18-238, 2.6.18-238.1.1 and 2.6.18-238.5.1?  We are running into a condition where VMs are hanging during the initial kernel boot process.  I'm unable to correlate these hangs to any particular ESX-level even, the VMs are running on different hosts and even different clusters.  All of the issues began witih the upgrade to EL 5.6 and kernel 2.6.18-238.1.1.el5 and persists in 2.6.18-238.5.1.el5.  This has affected more than 20 hosts at this point of all different configurations, but always EL 5.6 VMs only.  The issue is exactly the same.  During the initial kernel start, it gets as far as:

      PCI: Setting latency timer of device 0000:00:01.0 to 64
      NET: Registered protocol family 2
      IP route cache hash table entries: 32768 (order: 5, 131072 bytes)
      TCP established hash table entries: 131072 (order: 8, 1048576 bytes)
      TCP bind hash table entries: 65536 (order: 7, 524288 bytes)
      TCP: Hash tables configured (established 131072 bind 65536)
      TCP reno registered
      Simple Boot Flag at 0x36 set to 0x80


      The next line on all VMs that boot successfully is:


      Using TSC for driving interrupts


      However VMs that are hanging during boot never reach the "Using TSC..." line.  This leads me to believe that the problem is related to the OS electing to use TSC as the clocksouce and that is somehow an unstable combination with ESX 3.5 (build 317866) and EL 5.6 VMs.  However the issue is sporadic and I can't make this issue occur - simply that when a VM fails to boot, they all fail in the same place in the same way.  I've considered moving back to clocksource=acpi_pm divider=10 that was recommended for EL 5.3 and previously, but I'm hesitant to do that since TSC is clealy a better-performing timekeeper.


      Anyone seeing this?  Any resolution?