6 Replies Latest reply on Jun 7, 2007 6:05 AM by xenzen

    Vmware-serverd unresponsive and vm's stop running

    dotcom Novice

      Hey there!

       

      I have a dual Xeon machine, Intel motherboard, 4.5g of ram, 1TB of disk on a 3ware drive array running Ubuntu LTS.

       

      The hardware is proven and solid, having been in production as an email server for a couple years, we just migrated email to a newer server...

       

      I was running vmware server 1.0.2, was having a problem, so yesterday I upgraded to 1.0.3, and am having the same problem.

       

      The host OS is running fine, however my windows server console is unable to connect to the server, and the web based console serves pages, but is also unable to log in.  The web based console reports this error:

      Unexpected response from vmware-authd: 510 Could not create lock for vmware-serverd

       

      I assume that vmware-serverd is simply not responding.

       

      I get a similar result from vmware-cmd:

      root@ubuntu2z280:/root# vmware-cmd -l

      /usr/bin/vmware-cmd: Could not connect to vmware-authd

        (VMControl error -14: Unexpected response from vmware-authd: 510 Could not create lock for vmware-serverd)

       

      The virtual machines are unresponsive, but ps shows all of the expected processes exist, but none are consuming expected CPU.  There is plenty of free memory,  the configured virtual machines only consume 25% of the available memory.  The server is on private addresses, and can access the net only via nat, the vm's are on public addresses, running ubuntu LTS with all updates applied.

       

      As far as I can tell, I'm not doing anything stressful.  The vm's are all clones of each other, 3 out of 4 are running apache2 with a 2 byte index.html file being served.  The 4th machine is running cacti and nagios, cacti is scanning 4 SNMP devices, nagios is scanning 140 hosts with pings.

       

      Everything is set-up with out-of-the-box settings, with very little custom configuration... the vm's are all running vmware tools.

       

      Does anyone have any idea what is causing my problems?

       

      The server typically will run fine for one or two days, then the vm's stop responding at the same time I loose access to the console.  A reboot always fixes it, though the shutdown is not graceful for the vm's as far as I can tell.

       

      Obviously the host OS is healthy, as I can ssh into the machine, and run utilities such as ps and top with no problem...  I've been working around unix machines for 20 years, so the only "new" ingredients are vmware and ubuntu.

       

      When it works, it works well.

       

      The one other variable is the ethernet switch is not on UPS power, so as storms go past, the ethernet has cycled several times... but the power IS protected by a chain of three large GE subpanel surge supressors, and surge supression on the power strip...  I mention all of this just in case loosing the ethernet is a potential cause of lockup of the vmware code.

      Of course my ssh sessions into the server are through the same ethernet port that is used for all other purposes... so the ethernet port itself does work.

       

      I see vmware-rtc and httpd-vmware both consume small amounts of CPU from time to time,  and over a half hour period I see that vmware.serverd consumed .04 seconds of CPU...  So it appears to be alive.  None of the vmware-vmx processes have consumed any cpu, though the cacti/nagios should be consuming a fair amount of cpu.  In the last 24 hours the vmx processes consumed 2 hours of cpu total, in the last hour, none.

       

      Looks to me that vmware-serverd has stopped processing interrupts from the vmware-vmx processes (or something like that).

       

      Anyone have any clues what might be the cause?  Is it my ethernet issues?

      I will move my switch to a UPS later today if I have time...  Just to help debug this.

       

       

       

      George

        • 1. Re: Vmware-serverd unresponsive and vm's stop running
          IdeCable Novice

          Hi,

           

          I may assume you have more experience with Unix/Linux than me. But still: I'll give it a shot.. hehe.

           

          I remember my vmware-serverd was unable to run once because my kernel modules for my running kernel was not compiled proprely.

           

          Could you post the output of the vmware-config.pl? Or just to confirm the kernel modules are proprely compiled and active?? By restarting your vmware service from /etc/inid.d it should tell if the modules are loaded or not..

           

          Also, what's your kernel version?? I read two posts talking about 2.6.20 and up is creating stability problems :S

          • 2. Re: Vmware-serverd unresponsive and vm's stop running
            IdeCable Novice

            Ohh and by the way,

             

            I had a crazy time trying to deal with a Gentoo based AMD64 host OS. I had many system crash / freeze when running my virtual machine with VMware Server. I'm down to i686 kernel with SMP and ACPI enabled over an AMD64 X2 cpu (Debian based). It's very stable ever since.

             

            Is your Xeon system EMT64 enabled? What's the arch of your kernel?

             

            A post of your dmesg could be useful as well to examine this situation...

             

            How many different kernels have your tried over?

            • 3. Re: Vmware-serverd unresponsive and vm's stop running
              IdeCable Novice

              An other posts suggests that 2.6.18 is the most stable kernel so far.

              • 4. Re: Vmware-serverd unresponsive and vm's stop running
                dotcom Novice

                Hey there!

                 

                Sorry to take so long to answer, lightning struck one of my radio towers, and made a mess of things... we have been getting so many electrical storms and other issues that the power is lost to my office twice a day on average, and it hit the end of the UPS battery twice...

                 

                I'm thinking that I need to reinstall the server, as it has crashed so many times that it finally came up with a "run fsck by hand" prompt...  I also saw several hundred warnings from the drive controller regarding repaired sectors...  Perhaps the nagios/cacti vm is getting stuck behind disk errors.

                 

                I usually run the daemon that monitors the disk subsystem, but I have not gotten around to installing it...  I may have a corrupted filesystem.  the drive controller has it's own battery, so that during a power loss it can save the write cache and re-run the last writes during the next boot... but at least once power may have been off too long and the write buffer lost.

                 

                The system has 12 SATA drives, and while the cpu survives the UPS switching events just fine, it is possible that the drive power supply does not.  So perhaps the vm's are getting stuck behind io errors caused by the drives getting into a bad state...  Yeah, just now the nagios/cacti vm failed to start due to a file system error...  The error could be caused by the fsck, as a result of forced reboot, or it could be a symptom of something else...  My power situation is too unstable to trust.

                 

                I'm including a dmesg anyway, but I think the real solution is to rebuild the machine, and get it on good power... THEN ask for help if I really need it.

                 

                I started with kernel 2.6.15-26, and I updated to 2.6.15-28.  When I installed vmware server, it compiled the module for the kernel successfully...

                 

                I only mention the 20 years of experience to get past some of the obvious issues...  But after seeing dmesg reporting all of the drive issues, I'm thinking that my experience isn't doing me justice if I fail to observe the logs...  I have been too busy repairing things to do this one right... I probably need to start over and get on good power before I bother people with silly problems...

                 

                Below is a dmesg from a clean boot.

                 

                $ dmesg

                \[42949372.960000] Linux version 2.6.15-28-server (buildd@terranova) (gcc version 4.0.3 (Ubuntu 4.0.3-1ubuntu5)) #1 SMP Thu May 10 10:40:27 UTC 2007

                \[42949372.960000] BIOS-provided physical RAM map:

                \[42949372.960000]  BIOS-e820: 0000000000000000 - 000000000009f000 (usable)

                \[42949372.960000]  BIOS-e820: 000000000009f000 - 00000000000a0000 (reserved)

                \[42949372.960000]  BIOS-e820: 00000000000e8000 - 0000000000100000 (reserved)

                \[42949372.960000]  BIOS-e820: 0000000000100000 - 00000000efff0000 (usable)

                \[42949372.960000]  BIOS-e820: 00000000efff0000 - 00000000effff000 (ACPI data)

                \[42949372.960000]  BIOS-e820: 00000000effff000 - 00000000f0000000 (ACPI NVS)

                \[42949372.960000]  BIOS-e820: 00000000fec00000 - 00000000fed00000 (reserved)

                \[42949372.960000]  BIOS-e820: 00000000fee00000 - 00000000fee01000 (reserved)

                \[42949372.960000]  BIOS-e820: 00000000fff00000 - 0000000100000000 (reserved)

                \[42949372.960000]  BIOS-e820: 0000000100000000 - 0000000130000000 (usable)

                \[42949372.960000] 3968MB HIGHMEM available.

                \[42949372.960000] 896MB LOWMEM available.

                \[42949372.960000] found SMP MP-table at 000ff780

                \[42949372.960000] On node 0 totalpages: 1245184

                \[42949372.960000]   DMA zone: 4096 pages, LIFO batch:0

                \[42949372.960000]   DMA32 zone: 0 pages, LIFO batch:0

                \[42949372.960000]   Normal zone: 225280 pages, LIFO batch:31

                \[42949372.960000]   HighMem zone: 1015808 pages, LIFO batch:31

                \[42949372.960000] DMI 2.3 present.

                \[42949372.960000] ACPI: RSDP (v000 INTEL                                 ) @ 0x000ff9b0

                \[42949372.960000] ACPI: RSDT (v001 INTEL  S7501HG0 0x00000001 MSFT 0x01000000) @ 0xefff0000

                \[42949372.960000] ACPI: FADT (v001 INTEL  S7501HG0 0x00000001 MSFT 0x01000000) @ 0xefff0030

                \[42949372.960000] ACPI: MADT (v001 INTEL  S7501HG0 0x00000001 MSFT 0x01000000) @ 0xefff00b0

                \[42949372.960000] ACPI: OEMR (v001 INTEL  S7501HG0 0x00000001 MSFT 0x01000000) @ 0xefff0140

                \[42949372.960000] ACPI: DSDT (v001  INTEL S7501HG0 0x00000100 INTL 0x20020918) @ 0x00000000

                \[42949372.960000] ACPI: PM-Timer IO Port: 0x408

                \[42949372.960000] ACPI: Local APIC address 0xfee00000

                \[42949372.960000] ACPI: LAPIC (acpi_id\[0x00] lapic_id\[0x00] enabled)

                \[42949372.960000] Processor #0 15:2 APIC version 20

                \[42949372.960000] ACPI: LAPIC (acpi_id\[0x01] lapic_id\[0x06] enabled)

                \[42949372.960000] Processor #6 15:2 APIC version 20

                \[42949372.960000] ACPI: LAPIC (acpi_id\[0x02] lapic_id\[0x00] disabled)

                \[42949372.960000] ACPI: LAPIC (acpi_id\[0x03] lapic_id\[0x00] disabled)

                \[42949372.960000] ACPI: LAPIC_NMI (acpi_id\[0x00] high level lint\[0x1])

                \[42949372.960000] ACPI: LAPIC_NMI (acpi_id\[0x01] high level lint\[0x1])

                \[42949372.960000] ACPI: IOAPIC (id\[0x08] address\[0xfec00000] gsi_base[0])

                \[42949372.960000] IOAPIC[0]: apic_id 8, version 32, address 0xfec00000, GSI 0-23

                \[42949372.960000] ACPI: IOAPIC (id\[0x09] address\[0xfec81000] gsi_base\[24])

                \[42949372.960000] IOAPIC[1]: apic_id 9, version 32, address 0xfec81000, GSI 24-47

                \[42949372.960000] ACPI: IOAPIC (id\[0x0a] address\[0xfec81400] gsi_base\[48])

                \[42949372.960000] IOAPIC[2]: apic_id 10, version 32, address 0xfec81400, GSI 48-71

                \[42949372.960000] ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl)

                \[42949372.960000] ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 high level)

                \[42949372.960000] ACPI: IRQ0 used by override.

                \[42949372.960000] ACPI: IRQ2 used by override.

                \[42949372.960000] ACPI: IRQ9 used by override.

                \[42949372.960000] Enabling APIC mode:  Flat.  Using 3 I/O APICs

                \[42949372.960000] Using ACPI (MADT) for SMP configuration information

                \[42949372.960000] Allocating PCI resources starting at f1000000 (gap: f0000000:0ec00000)

                \[42949372.960000] Built 1 zonelists

                \[42949372.960000] Kernel command line: root=/dev/sda1 ro quiet splash

                \[42949372.960000] mapped APIC to ffffd000 (fee00000)

                \[42949372.960000] mapped IOAPIC to ffffc000 (fec00000)

                \[42949372.960000] mapped IOAPIC to ffffb000 (fec81000)

                \[42949372.960000] mapped IOAPIC to ffffa000 (fec81400)

                \[42949372.960000] Initializing CPU#0

                \[42949372.960000] PID hash table entries: 4096 (order: 12, 65536 bytes)

                \[42949372.960000] Detected 2791.047 MHz processor.

                \[42949372.960000] Using pmtmr for high-res timesource

                \[42949372.960000] Console: colour VGA+ 80x25

                \[42949375.460000] Dentry cache hash table entries: 131072 (order: 7, 524288 bytes)

                \[42949375.460000] Inode-cache hash table entries: 65536 (order: 6, 262144 bytes)

                \[42949375.800000] Memory: 4666884k/4980736k available (2087k kernel code, 50344k reserved, 620k data, 332k init, 3801024k highmem)

                \[42949375.800000] Checking if this processor honours the WP bit even in supervisor mode... Ok.

                \[42949375.950000] Calibrating delay using timer specific routine.. 5586.64 BogoMIPS (lpj=27933244)

                \[42949375.950000] Security Framework v1.0.0 initialized

                \[42949375.950000] SELinux:  Disabled at boot.

                \[42949375.950000] Mount-cache hash table entries: 512

                \[42949375.950000] CPU: After generic identify, caps: bfebfbff 00000000 00000000 00000000 00004400 00000000 00000000

                \[42949375.950000] CPU: After vendor identify, caps: bfebfbff 00000000 00000000 00000000 00004400 00000000 00000000

                \[42949375.950000] CPU: Trace cache: 12K uops, L1 D cache: 8K

                \[42949375.950000] CPU: L2 cache: 512K

                \[42949375.950000] CPU: Hyper-Threading is disabled

                \[42949375.950000] CPU: After all inits, caps: bfebfbff 00000000 00000000 00000080 00004400 00000000 00000000

                \[42949375.950000] mtrr: v2.0 (20020519)

                \[42949375.950000] Enabling fast FPU save and restore... done.

                \[42949375.950000] Enabling unmasked SIMD FPU exception support... done.

                \[42949375.950000] Checking 'hlt' instruction... OK.

                \[42949375.990000] checking if image is initramfs... it is

                \[42949376.450000] Freeing initrd memory: 6789k freed

                \[42949376.460000] ACPI: Looking for DSDT ... not found!

                \[42949376.460000]     ACPI-0654: *** Warning: Type override - \[DEB_] had invalid type (Integer) for Scope operator, changed to (Scope)

                \[42949376.460000]     ACPI-0654: *** Warning: Type override - \[MLIB] had invalid type (Integer) for Scope operator, changed to (Scope)

                \[42949376.460000]     ACPI-0654: *** Warning: Type override - \[DATA] had invalid type (String) for Scope operator, changed to (Scope)

                \[42949376.460000]     ACPI-0654: *** Warning: Type override - \[SIO_] had invalid type (String) for Scope operator, changed to (Scope)

                \[42949376.460000]     ACPI-0654: *** Warning: Type override - \[LEDP] had invalid type (String) for Scope operator, changed to (Scope)

                \[42949376.460000]     ACPI-0654: *** Warning: Type override - \[GPEN] had invalid type (String) for Scope operator, changed to (Scope)

                \[42949376.460000]     ACPI-0654: *** Warning: Type override - \[GPST] had invalid type (String) for Scope operator, changed to (Scope)

                \[42949376.460000]     ACPI-0654: *** Warning: Type override - \[GP1N] had invalid type (String) for Scope operator, changed to (Scope)

                \[42949376.460000]     ACPI-0654: *** Warning: Type override - \[WUES] had invalid type (String) for Scope operator, changed to (Scope)

                \[42949376.460000]     ACPI-0654: *** Warning: Type override - \[WUSE] had invalid type (String) for Scope operator, changed to (Scope)

                \[42949376.460000]     ACPI-0654: *** Warning: Type override - \[SBID] had invalid type (String) for Scope operator, changed to (Scope)

                \[42949376.460000]     ACPI-0654: *** Warning: Type override - \[SWCE] had invalid type (String) for Scope operator, changed to (Scope)

                \[42949376.460000]     ACPI-0654: *** Warning: Type override - \[SMIR] had invalid type (String) for Scope operator, changed to (Scope)

                \[42949376.460000] CPU0: Intel(R) Xeon(TM) CPU 2.80GHz stepping 05

                \[42949376.460000] Booting processor 1/6 eip 3000

                \[42949376.470000] Initializing CPU#1

                \[42949376.620000] Calibrating delay using timer specific routine.. 5582.28 BogoMIPS (lpj=27911427)

                \[42949376.620000] CPU: After generic identify, caps: bfebfbff 00000000 00000000 00000000 00004400 00000000 00000000

                \[42949376.620000] CPU: After vendor identify, caps: bfebfbff 00000000 00000000 00000000 00004400 00000000 00000000

                \[42949376.620000] CPU: Trace cache: 12K uops, L1 D cache: 8K

                \[42949376.620000] CPU: L2 cache: 512K

                \[42949376.620000] CPU: Hyper-Threading is disabled

                \[42949376.620000] CPU: After all inits, caps: bfebfbff 00000000 00000000 00000080 00004400 00000000 00000000

                \[42949376.620000] CPU1: Intel(R) Xeon(TM) CPU 2.80GHz stepping 05

                \[42949376.620000] Total of 2 processors activated (11168.93 BogoMIPS).

                \[42949376.620000] ENABLING IO-APIC IRQs

                \[42949376.620000] ..TIMER: vector=0x31 apic1=0 pin1=2 apic2=-1 pin2=-1

                \[42949376.830000] checking TSC synchronization across 2 CPUs: passed.

                \[42949376.840000] Brought up 2 CPUs

                \[42949376.840000] NET: Registered protocol family 16

                \[42949376.840000] EISA bus registered

                \[42949376.840000] ACPI: bus type pci registered

                \[42949376.840000] PCI: PCI BIOS revision 2.10 entry at 0xfdb55, last bus=4

                \[42949376.840000] PCI: Using configuration type 1

                \[42949376.840000] ACPI: Subsystem revision 20051216

                \[42949376.840000] ACPI: Interpreter enabled

                \[42949376.840000] ACPI: Using IOAPIC for interrupt routing

                \[42949376.840000] ACPI: PCI Root Bridge \[PCI0] (0000:00)

                \[42949376.840000] PCI: Probing PCI hardware (bus 00)

                \[42949376.840000] PCI quirk: region 0400-047f claimed by ICH4 ACPI/GPIO/TCO

                \[42949376.840000] PCI quirk: region 0480-04bf claimed by ICH4 GPIO

                \[42949376.840000] PCI: Ignoring BAR0-3 of IDE controller 0000:00:1f.1

                \[42949376.840000] Boot video device is 0000:01:0c.0

                \[42949376.840000] PCI: Transparent bridge - 0000:00:1e.0

                \[42949376.840000] ACPI: PCI Interrupt Routing Table \[\_SB_.PCI0._PRT]

                \[42949376.850000] ACPI: Embedded Controller \[EC0] (gpe 8) interrupt mode.

                \[42949376.850000] ACPI: PCI Interrupt Link \[LNKA] (IRQs 3 4 5 6 7 9 10 *11 12 14 15)

                \[42949376.850000] ACPI: PCI Interrupt Link \[LNKB] (IRQs 3 4 5 6 7 9 10 11 12 14 15) *0, disabled.

                \[42949376.850000] ACPI: PCI Interrupt Link \[LNKC] (IRQs 3 4 5 6 7 9 *10 11 12 14 15)

                \[42949376.850000] ACPI: PCI Interrupt Link \[LNKD] (IRQs 3 4 *5 6 7 9 10 11 12 14 15)

                \[42949376.850000] ACPI: PCI Interrupt Link \[LNKE] (IRQs 11 12 14 15) *0, disabled.

                \[42949376.850000] ACPI: PCI Interrupt Link \[LNKF] (IRQs 3 4 5 6 7 9 10 11 12 14 15) *0, disabled.

                \[42949376.850000] ACPI: PCI Interrupt Link \[LNKG] (IRQs 3 4 5 6 7 *9 10 11 12 14 15)

                \[42949376.850000] ACPI: PCI Interrupt Link \[LNKH] (IRQs 3 4 5 6 7 9 10 *11 12 14 15)

                \[42949376.850000] ACPI: PCI Interrupt Routing Table \[\_SB_.PCI0.P0P1._PRT]

                \[42949376.850000] ACPI: PCI Interrupt Routing Table \[\_SB_.PCI0.P0P5.P5P6._PRT]

                \[42949376.850000] ACPI: PCI Interrupt Routing Table \[\_SB_.PCI0.P0P5.P5P7._PRT]

                \[42949376.850000] Linux Plug and Play Support v0.97 (c) Adam Belay

                \[42949376.850000] pnp: PnP ACPI init

                \[42949376.860000] pnp: PnP ACPI: found 11 devices

                \[42949376.860000] PnPBIOS: Disabled by ACPI PNP

                \[42949376.860000] PCI: Using ACPI for IRQ routing

                \[42949376.860000] PCI: If a device doesn't work, try "pci=routeirq".  If it helps, post a report

                \[42949376.870000] pnp: 00:00: ioport range 0x3f0-0x3f1 has been reserved

                \[42949376.870000] pnp: 00:00: ioport range 0x400-0x4bf could not be reserved

                \[42949376.870000] pnp: 00:00: ioport range 0x4d0-0x4d1 has been reserved

                \[42949376.870000] pnp: 00:00: ioport range 0x40b-0x40b could not be reserved

                \[42949376.870000] pnp: 00:00: ioport range 0x4d6-0x4d6 has been reserved

                \[42949376.870000] PCI: Bridge: 0000:02:1d.0

                \[42949376.870000]   IO window: 2000-2fff

                \[42949376.870000]   MEM window: fea00000-feafffff

                \[42949376.870000]   PREFETCH window: fb900000-fc8fffff

                \[42949376.870000] PCI: Bridge: 0000:02:1f.0

                \[42949376.870000]   IO window: disabled.

                \[42949376.870000]   MEM window: disabled.

                \[42949376.870000]   PREFETCH window: disabled.

                \[42949376.870000] PCI: Bridge: 0000:00:03.0

                \[42949376.870000]   IO window: 2000-2fff

                \[42949376.870000]   MEM window: fea00000-febfffff

                \[42949376.870000]   PREFETCH window: fb900000-fc8fffff

                \[42949376.870000] PCI: Bridge: 0000:00:1e.0

                \[42949376.870000]   IO window: 1000-1fff

                \[42949376.870000]   MEM window: fc900000-fe9fffff

                \[42949376.870000]   PREFETCH window: f1000000-f10fffff

                \[42949376.870000] PCI: Setting latency timer of device 0000:00:1e.0 to 64

                \[42949376.870000] audit: initializing netlink socket (disabled)

                \[42949376.870000] audit(1180758012.860:1): initialized

                \[42949376.870000] highmem bounce pool size: 64 pages

                \[42949376.870000] VFS: Disk quotas dquot_6.5.1

                \[42949376.870000] Dquot-cache hash table entries: 1024 (order 0, 4096 bytes)

                \[42949376.870000] Initializing Cryptographic API

                \[42949376.870000] io scheduler noop registered

                \[42949376.870000] io scheduler anticipatory registered

                \[42949376.870000] io scheduler deadline registered

                \[42949376.870000] io scheduler cfq registered

                \[42949376.870000] isapnp: Scanning for PnP cards...

                \[42949377.230000] isapnp: No Plug & Play device found

                \[42949377.250000] Real Time Clock Driver v1.12

                \[42949377.250000] PNP: PS/2 Controller \[PNP0303:PS2K] at 0x60,0x64 irq 1

                \[42949377.250000] PNP: PS/2 controller doesn't have AUX irq; using default 12

                \[42949377.250000] serio: i8042 AUX port at 0x60,0x64 irq 12

                \[42949377.250000] serio: i8042 KBD port at 0x60,0x64 irq 1

                \[42949377.250000] Serial: 8250/16550 driver $Revision: 1.90 $ 48 ports, IRQ sharing enabled

                \[42949377.250000] serial8250: ttyS0 at I/O 0x3f8 (irq = 4) is a 16550A

                \[42949377.250000] serial8250: ttyS1 at I/O 0x2f8 (irq = 3) is a 16550A

                \[42949377.250000] 00:07: ttyS0 at I/O 0x3f8 (irq = 4) is a 16550A

                \[42949377.250000] 00:08: ttyS1 at I/O 0x2f8 (irq = 3) is a 16550A

                \[42949377.250000] RAMDISK driver initialized: 16 RAM disks of 65536K size 1024 blocksize

                \[42949377.250000] Uniform Multi-Platform E-IDE driver Revision: 7.00alpha2

                \[42949377.250000] ide: Assuming 33MHz system bus speed for PIO modes; override with idebus=xx

                \[42949377.250000] mice: PS/2 mouse device common for all mice

                \[42949377.250000] EISA: Probing bus 0 at eisa.0

                \[42949377.250000] Cannot allocate resource for EISA slot 1

                \[42949377.250000] Cannot allocate resource for EISA slot 2

                \[42949377.250000] Cannot allocate resource for EISA slot 3

                \[42949377.250000] EISA: Detected 0 cards.

                \[42949377.250000] NET: Registered protocol family 2

                \[42949377.370000] IP route cache hash table entries: 262144 (order: 8, 1048576 bytes)

                \[42949377.370000] TCP established hash table entries: 524288 (order: 10, 4194304 bytes)

                \[42949377.370000] TCP bind hash table entries: 65536 (order: 7, 524288 bytes)

                \[42949377.370000] TCP: Hash tables configured (established 524288 bind 65536)

                \[42949377.370000] TCP reno registered

                \[42949377.370000] TCP bic registered

                \[42949377.370000] NET: Registered protocol family 1

                \[42949377.370000] NET: Registered protocol family 8

                \[42949377.370000] NET: Registered protocol family 20

                \[42949377.370000] Starting balanced_irq

                \[42949377.370000] Using IPI No-Shortcut mode

                \[42949377.370000] ACPI wakeup devices:

                \[42949377.370000] PS2K UAR1 UAR2 USB1 USB2 USB3 SMB0 P0P1 P5P6 P5P7

                \[42949377.370000] ACPI: (supports S0 S1 S4 S5)

                \[42949377.370000] Freeing unused kernel memory: 332k freed

                \[42949377.420000] vga16fb: initializing

                \[42949377.420000] vga16fb: mapped to 0xc00a0000

                \[42949377.440000] input: AT Translated Set 2 keyboard as /class/input/input0

                \[42949377.580000] Console: switching to colour frame buffer device 80x25

                \[42949377.580000] fb0: VGA16 VGA frame buffer device

                \[42949377.590000] Capability LSM initialized

                \[42949378.230000] SCSI subsystem initialized

                \[42949378.230000] 3ware 9000 Storage Controller device driver for Linux v2.26.02.004.

                \[42949378.230000] ACPI: PCI Interrupt 0000:04:02.0[A] -> GSI 52 (level, low) -> IRQ 169

                \[42949380.210000] 3w-9xxx: scsi0: AEN: INFO (0x04:0x0053): Battery capacity test is overdue:.

                \[42949380.330000] 3w-9xxx: scsi0: AEN: WARNING (0x04:0x0006): Incomplete unit detected:unit=1.

                \[42949380.450000] scsi0 : 3ware 9000 Storage Controller

                \[42949380.450000] 3w-9xxx: scsi0: Found a 3ware 9000 Storage Controller at 0xfeaf0000, IRQ: 169.

                \[42949380.810000] 3w-9xxx: scsi0: Firmware FE9X 2.04.00.005, BIOS BE9X 2.03.01.047, Ports: 8.

                \[42949380.810000]   Vendor: 3ware     Model: Logical Disk 00   Rev: 1.00

                \[42949380.810000]   Type:   Direct-Access                      ANSI SCSI revision: 00

                \[42949380.830000] Driver 'sd' needs updating - please use bus_type methods

                \[42949380.830000] SCSI device sda: 1249910784 512-byte hdwr sectors (639954 MB)

                \[42949380.830000] SCSI device sda: drive cache: none

                \[42949380.830000] SCSI device sda: 1249910784 512-byte hdwr sectors (639954 MB)

                \[42949380.830000] SCSI device sda: drive cache: none

                \[42949380.830000]  sda: sda1 sda2 < sda5 >

                \[42949380.860000] sd 0:0:0:0: Attached scsi disk sda

                \[42949381.050000] ICH3: IDE controller at PCI slot 0000:00:1f.1

                \[42949381.050000] PCI: Enabling device 0000:00:1f.1 (0005 -> 0007)

                \[42949381.050000] ACPI: PCI Interrupt 0000:00:1f.1[A] -> GSI 16 (level, low) -> IRQ 177

                \[42949381.050000] ICH3: chipset revision 2

                \[42949381.050000] ICH3: not 100% native mode: will probe irqs later

                \[42949381.050000]     ide0: BM-DMA at 0x03a0-0x03a7, BIOS settings: hda:DMA, hdb:pio

                \[42949381.050000]     ide1: BM-DMA at 0x03a8-0x03af, BIOS settings: hdc:pio, hdd:pio

                \[42949381.050000] Probing IDE interface ide0...

                \[42949381.830000] hda: SAMSUNG CD-ROM SN-124, ATAPI CD/DVD-ROM drive

                \[42949382.550000] ide0 at 0x1f0-0x1f7,0x3f6 on irq 14

                \[42949382.550000] Probing IDE interface ide1...

                \[42949383.160000] hda: ATAPI 24X CD-ROM drive, 128kB Cache, UDMA(33)

                \[42949383.160000] Uniform CD-ROM driver Revision: 3.20

                \[42949383.230000] usbcore: registered new driver usbfs

                \[42949383.230000] usbcore: registered new driver hub

                \[42949383.230000] USB Universal Host Controller Interface driver v2.3

                \[42949383.230000] ACPI: PCI Interrupt 0000:00:1d.0[A] -> GSI 16 (level, low) -> IRQ 177

                \[42949383.230000] PCI: Setting latency timer of device 0000:00:1d.0 to 64

                \[42949383.230000] uhci_hcd 0000:00:1d.0: UHCI Host Controller

                \[42949383.230000] uhci_hcd 0000:00:1d.0: new USB bus registered, assigned bus number 1

                \[42949383.230000] uhci_hcd 0000:00:1d.0: irq 177, io base 0x00003040

                \[42949383.230000] hub 1-0:1.0: USB hub found

                \[42949383.230000] hub 1-0:1.0: 2 ports detected

                \[42949383.340000] ACPI: PCI Interrupt 0000:00:1d.1[B] -> GSI 19 (level, low) -> IRQ 185

                \[42949383.340000] PCI: Setting latency timer of device 0000:00:1d.1 to 64

                \[42949383.340000] uhci_hcd 0000:00:1d.1: UHCI Host Controller

                \[42949383.340000] uhci_hcd 0000:00:1d.1: new USB bus registered, assigned bus number 2

                \[42949383.340000] uhci_hcd 0000:00:1d.1: irq 185, io base 0x00003020

                \[42949383.340000] hub 2-0:1.0: USB hub found

                \[42949383.340000] hub 2-0:1.0: 2 ports detected

                \[42949383.450000] ACPI: PCI Interrupt 0000:00:1d.2[C] -> GSI 18 (level, low) -> IRQ 193

                \[42949383.450000] PCI: Setting latency timer of device 0000:00:1d.2 to 64

                \[42949383.450000] uhci_hcd 0000:00:1d.2: UHCI Host Controller

                \[42949383.450000] uhci_hcd 0000:00:1d.2: new USB bus registered, assigned bus number 3

                \[42949383.450000] uhci_hcd 0000:00:1d.2: irq 193, io base 0x00003000

                \[42949383.450000] hub 3-0:1.0: USB hub found

                \[42949383.450000] hub 3-0:1.0: 2 ports detected

                \[42949383.590000] Probing IDE interface ide1...

                \[42949384.220000] Attempting manual resume

                \[42949384.260000] EXT3-fs: INFO: recovery required on readonly filesystem.

                \[42949384.260000] EXT3-fs: write access will be enabled during recovery.

                \[42949387.310000] kjournald starting.  Commit interval 5 seconds

                \[42949387.310000] EXT3-fs: sda1: orphan cleanup on readonly fs

                \[42949387.320000] ext3_orphan_cleanup: deleting unreferenced inode 55721994

                \[42949387.320000] EXT3-fs: sda1: 1 orphan inode deleted

                \[42949387.320000] EXT3-fs: recovery complete.

                \[42949387.450000] EXT3-fs: mounted filesystem with ordered data mode.

                \[42949389.730000] sd 0:0:0:0: Attached scsi generic sg0 type 0

                \[42949390.420000] input: PC Speaker as /class/input/input1

                \[42949390.470000] Floppy drive(s): fd0 is 1.44M

                \[42949390.490000] FDC 0 is a National Semiconductor PC87306

                \[42949390.570000] pci_hotplug: PCI Hot Plug PCI Core version: 0.5

                \[42949390.580000] shpchp: Standard Hot Plug PCI Controller Driver version: 0.4

                \[42949390.730000] parport: PnPBIOS parport detected.

                \[42949390.730000] parport0: PC-style at 0x378 (0x778), irq 7, dma 3 \[PCSPP,TRISTATE,COMPAT,ECP,DMA]

                \[42949390.770000] hw_random hardware driver 1.0.0 loaded

                \[42949390.830000] Intel(R) PRO/1000 Network Driver - version 7.0.33-k2

                \[42949390.830000] Copyright (c) 1999-2005 Intel Corporation.

                \[42949390.830000] ACPI: PCI Interrupt 0000:04:05.0[A] -> GSI 58 (level, low) -> IRQ 201

                \[42949391.120000] e1000: 0000:04:05.0: e1000_probe: (PCI:66MHz:64-bit) 00:07:e9:40:14:c4

                \[42949391.160000] e1000: eth0: e1000_probe: Intel(R) PRO/1000 Network Connection

                \[42949391.160000] ACPI: PCI Interrupt 0000:04:05.1[B] -> GSI 59 (level, low) -> IRQ 209

                \[42949391.460000] e1000: 0000:04:05.1: e1000_probe: (PCI:66MHz:64-bit) 00:07:e9:40:14:c5

                \[42949391.500000] e1000: eth1: e1000_probe: Intel(R) PRO/1000 Network Connection

                \[42949391.880000] lp0: using parport0 (interrupt-driven).

                \[42949391.990000] Adding 11406108k swap on /dev/sda5.  Priority:-1 extents:1 across:11406108k

                \[42949392.140000] EXT3 FS on sda1, internal journal

                \[42949392.370000] md: md driver 0.90.3 MAX_MD_DEVS=256, MD_SB_DISKS=27

                \[42949392.370000] md: bitmap version 4.39

                \[42949393.110000] device-mapper: 4.4.0-ioctl (2005-01-12) initialised: dm-devel@redhat.com

                \[42949393.190000] e1000: eth1: e1000_watchdog_task: NIC Link is Up 100 Mbps Full Duplex

                \[42949393.930000] cdrom: open failed.

                \[42949394.200000] NET: Registered protocol family 10

                \[42949394.200000] lo: Disabled Privacy Extensions

                \[42949394.210000] IPv6 over IPv4 tunneling driver

                \[42949397.780000] vmmon: module license 'unspecified' taints kernel.

                \[42949397.780000] /dev/vmmon\[3715]: Module vmmon: registered with major=10 minor=165

                \[42949397.780000] /dev/vmmon\[3715]: Module vmmon: initialized

                \[42949398.220000] /dev/vmnet: open called by PID 3743 (vmnet-bridge)

                \[42949398.220000] /dev/vmnet: hub 0 does not exist, allocating memory.

                \[42949398.220000] /dev/vmnet: port on hub 0 successfully opened

                \[42949398.220000] bridge-eth1: enabling the bridge

                \[42949398.220000] bridge-eth1: up

                \[42949398.220000] bridge-eth1: already up

                \[42949398.220000] bridge-eth1: attached

                \[42949398.230000] /dev/vmnet: open called by PID 3748 (vmnet-bridge)

                \[42949398.230000] /dev/vmnet: hub 2 does not exist, allocating memory.

                \[42949398.230000] /dev/vmnet: port on hub 2 successfully opened

                \[42949398.230000] bridge-eth0: peer interface eth0 not found, will wait for it to come up

                \[42949398.230000] bridge-eth0: attached

                \[42949402.600000] /dev/vmnet: open called by PID 3824 (vmware-vmx)

                \[42949402.600000] device eth1 entered promiscuous mode

                \[42949402.600000] bridge-eth1: enabled promiscuous mode

                \[42949402.600000] /dev/vmnet: port on hub 0 successfully opened

                \[42949402.670000] /dev/vmmon\[3836]: host clock rate change request 0 -> 19

                \[42949404.390000] eth1: no IPv6 routers present

                \[42949410.010000] /dev/vmmon\[3836]: host clock rate change request 19 -> 100

                \[42949411.680000] /dev/vmmon\[3836]: host clock rate change request 100 -> 200

                \[42949424.850000] /dev/vmmon\[3836]: host clock rate change request 200 -> 201

                \[42949440.850000] /dev/vmmon\[3836]: host clock rate change request 201 -> 200

                \[42949582.160000] vmmon: Had to deallocate AWE 16 pages from vm driver f71ed000

                • 5. Re: Vmware-serverd unresponsive and vm's stop running
                  IdeCable Novice

                  Hi again,

                   

                  Yeah we're having lots of storms tonight as well (working overnight to finalise my VMware Server lol).

                   

                  It is a fact that \*12* drives over a UPS takes lots of power. I suppose you have access to some sort of utility from your UPS software to monitor how many watts your whole system is taking. Over here's we're taking about 60% of the total load the UPS can usualy take (according to their utility software lol). I don't have any servers with more than one power supply.

                   

                  Suggestion: how about just adding an other small UPS only for that power supply for your hard drives? And plugging the USP power source within your UPS circuit? So that way your hard drives would get a total of two UPS (in serial)..

                   

                  If it is less time consuming for you to re-install your OS then I cannot say it would be a bad thing to do. But then again: this is an open-sourced OS. If your OS partitions are healty, then it should not be needed to reinstall.. My point is that open-sourced OS \*don't* have any dark secrets like the closed-source ones where it come up to a point (sometimes) where the \*only* solution is to reinstall.

                   

                  Every Linux server I uses deals with ext2 for boot partitions and ReiserFS everywhere. I think Ubuntu uses ReiserFS by default as well..

                   

                  I had to deal with a fsck by hand over a Gentoo 2006.1 box one day.. I had to rebuild the whole tree (only lost one file..). I was lucky it was not anything more than that..

                   

                  If physical damage to your HDs is not an issue to this whole thing, then I \*assume* it could be possible for you to recover your system without having to re-create everything from scratch.

                   

                  I highly recommend performing the fsck using a live cd of some sort.. Like the Gentoo 2006.1 minimal iso for example. boot cd or not, as long you can perform a full fsck without any interruptions of some sort.

                   

                  I suppose you have a RAID setup right? I know that software arrays in Linux (md devices) have some data scrubbing utilitys.. That could help who knows.

                   

                  I hope you'll find a good way to patch/fix your UPS issue .. But what do you do with your system when power failures happen? Can you afford to get your server into hibernation status while having a long power failure? The way you're explaining your situation it's the same as if you're trying as much as possible to keep it running all the time. My setup here is ment to do hibernation after the power does not get back to normal after 10 or 15 mins.

                   

                  \*If* you cannot afford downtime at all, then a stronger UPS setup may be needed :(. Upgrading your UPS setup will definately isolate further hard-drives-going-crazy adventures.

                   

                  Here's two links: one for the gentoo iso minimal cd and the RAID howto from the Gentoo Wiki (to re-mount by hand RAID setups..). Gentoo live-cd is one of the boot cds I use the most

                   

                  http://mirror.switch.ch/ftp/mirror/gentoo/releases/x86/2007.0/installcd/install-x86-minimal-2007.0-r1.iso

                   

                  http://gentoo-wiki.com/HOWTO_Gentoo_Install_on_Software_RAID

                   

                  I'm 95% assured you won't need these two links. I'm still copy-pasting them here for the only reason it's the only way I've been happy over trying to recover filesystems with sector failures (without being annoyed by the other services a distro can have already loaded in memory while \*trying* to fix a filesystem..)

                   

                  Nice dmesg by the way, Xeon rocks!

                   

                  Good luck with your filesystem recovery.

                  • 6. Re: Vmware-serverd unresponsive and vm's stop running
                    xenzen Lurker

                    Dotcom,

                     

                    I have experienced the same problem, "Could not create lock for vmware-serverd ". This is due to the directory where the lockfile is stored (usually "/var") running out of disk space.

                     

                    Do a "df -h" and look for the partition where "/var" is located and check it's status. You may find that "/var" is mounted on it's own partition in which case it will be obvious. However, if not it will be included in the "/" root partition.

                     

                    I try and mount "/var" on it's own partition as it is easier to administer. "/var" contains all sorts of changing data (log files, lock files, cache files etc) and it is essential to keep it under control using a log rotation regime such as logrotate. On boot the system will clean up stale lock files and other such transient stuff which may account for you being able to temporarily cure your problem by doing a reboot.

                     

                    Hope that helps.

                     

                    Dan