jasonmc
Contributor
Contributor

ESX 3.5 and Red Hat EL 5.6 VM Hang on Boot

Hello all,

Has anyone experienced issues with Red Hat EL 5.6 using kernels 2.6.18-238, 2.6.18-238.1.1 and 2.6.18-238.5.1?  We are running into a condition where VMs are hanging during the initial kernel boot process.  I'm unable to correlate these hangs to any particular ESX-level even, the VMs are running on different hosts and even different clusters.  All of the issues began witih the upgrade to EL 5.6 and kernel 2.6.18-238.1.1.el5 and persists in 2.6.18-238.5.1.el5.  This has affected more than 20 hosts at this point of all different configurations, but always EL 5.6 VMs only.  The issue is exactly the same.  During the initial kernel start, it gets as far as:


PCI: Setting latency timer of device 0000:00:01.0 to 64
NET: Registered protocol family 2
IP route cache hash table entries: 32768 (order: 5, 131072 bytes)
TCP established hash table entries: 131072 (order: 8, 1048576 bytes)
TCP bind hash table entries: 65536 (order: 7, 524288 bytes)
TCP: Hash tables configured (established 131072 bind 65536)
TCP reno registered
Simple Boot Flag at 0x36 set to 0x80

The next line on all VMs that boot successfully is:

Using TSC for driving interrupts

However VMs that are hanging during boot never reach the "Using TSC..." line.  This leads me to believe that the problem is related to the OS electing to use TSC as the clocksouce and that is somehow an unstable combination with ESX 3.5 (build 317866) and EL 5.6 VMs.  However the issue is sporadic and I can't make this issue occur - simply that when a VM fails to boot, they all fail in the same place in the same way.  I've considered moving back to clocksource=acpi_pm divider=10 that was recommended for EL 5.3 and previously, but I'm hesitant to do that since TSC is clealy a better-performing timekeeper.

Anyone seeing this?  Any resolution?

0 Kudos
2 Replies
5thfishie
Contributor
Contributor

I know this post is a bit old, but I just ran in to the same issue when upgrading several RHEL 5.5 VMs to 5.6 running on ESXi 4.1.  I got around the problem by adding the following options to the kernel boot paramaters.

clocksource=acpi_pm

Not ideal but it works for the time being.

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=100642...

0 Kudos
jasonmc
Contributor
Contributor

Actually, I've tracked this problem down to a set of patches introduced in the Red Hat EL 5.6 kernel (-238.el5 and later) that was attemping to beef up the timekeeping aspect of EL5 i386 kernels.  x86_64 kernel are unaffected.  In my tests, setting clocksource= to anything didn't help anything because TSC always ended up getting chosen for handling IRQs which is the crux of the problem.  However I did determine that setting divider=10 reduced dramatically the frequency of hitting the bug due to the fact that it reduces the number of timer interrupts by tenfold.  My bug report and the subsequent patch appears to have been accepted by Red Hat for inclusion in a future errata. https://bugzilla.redhat.com/show_bug.cgi?id=692966

0 Kudos