Bolgard
Contributor
Contributor

The s*** has hit the fan

Jump to solution

Hi there,

I'm having some trouble with VMware ESXi. It started when I created a new VM and installed it with Windows Server 2003. After shutting it down, things started happening:

  • If I go to the Configuration tab, Health status it says "Unknown" on all devices

  • I can't use the console for any VM, they respond with "Connection terminated" or, if they're powered off "A general system error occured"

So I figured a reboot might solve it. I entered maintenance mode and then clicked "Reboot". And now it seems halted somewhere in between. I can't access the commands "Reboot" and "Shutdown" anymore, and I can't exit maintenance mode.

If I look at the logs, in /var/log/messages I get alot of "init fn user failed with: Out of memory!" and "WorldInit failed: trying to cleanup.".

Anyone have any idea what's going on here?

EDIT: After a hard reboot I'm able to start the VMs again. Don't dare touching much though. What could be the problem here?

Message was edited by: tom howarth Profanity removed from Subject line

0 Kudos
1 Solution

Accepted Solutions
Jackobli
Virtuoso
Virtuoso

Well, what I possibly could have done wrong with the RAM, I have done wrong now that I think of it.

  • I added 2 GB RAM (not certified by FS since they were alot cheeper, though I get a warning in BIOS about that. Figured it wouldn't matter that much, I have had several different RAM types in workstations and have always worked fine)

The harder a machine is using it's RAM, the sooner you will see errors.

  • So there's totally 3 GB of RAM in 3 slots, so they're not grouper in pairs either.

Kingston says, it's not recommanded (performance deal), but should :smileyalert: work.

  • And I'm not sure the added 2 GB is ECC (though I guess it has to be if the motherboards supports ECC?)

Quite sure, that the server should at least throw out a BIG warning, but most probably it has to be ECC.

So we've found the problem there, don't you think?

I've seen so much problems with RAM, that I suppose, it is the source. But you never know.

So try with 1 GB or spend lots of bucks for original RAM or some bucks for compatible. I (nearly) newer had problems with Kingston RAM and they will at least try to support you. Dunno where you live, but I found 2x2 GBytes KFJ-E50/4G for US $112.00 at the Kingston shop.

View solution in original post

0 Kudos
34 Replies
Jackobli
Virtuoso
Virtuoso

Tell us something about your hardware.

0 Kudos
Bolgard
Contributor
Contributor

Running on a Fujitsu-Siemens RX100S4 with the according to VMwares I/O compatibility list supported Adaptec 3405 RAID controller.

I'm guessing it has something to do with the RAID controller even though it's supposed to be supported...? Any logs I could check?

0 Kudos
Jackobli
Virtuoso
Virtuoso

I'm guessing it has something to do with the RAID controller even though it's supposed to be supported...? Any logs I could check?

Hmm, don't know the 3405, any logs in the BIOS-Configuration utility? Any messages saved in the logs viewable through VI-Client?

You are running RAID1/5? SAS/SATA? BBU installed?

I think any problems in Disk/RAID should lead to other errors. ECC-Errors at your system? Bad RAM is always a source for errors.

Bolgard
Contributor
Contributor

I do currently not have physical access to the server, so I can't check any BIOS logs right now. There's no log messages before the hard reboot in the VI client, and I can't see anything abnormal there now.

Currently running two 250 GB SATA-disks in RAID 1, no networked attached storage. All VMs reside in the local storage mentioned. If BBU means Battery Back-up, it's not installed on the RAID card. But I've disabled write caching, so the data shouldn't be corrupted in case of power failure.

I guess I would need to run memtest or something similar to check for ECC-errors? Will do that when I access the server physically next time.

EDIT: Now when I come to think of it, there were some strange messages relating to RAM before the reboot. Something about heap and memory, and the log messages I mentioned in the first post in this thread.

0 Kudos
Jackobli
Virtuoso
Virtuoso

I guess I would need to run memtest or something similar to check for ECC-errors? Will do that when I access the server physically next time.

EDIT: Now when I come to think of it, there were some strange messages relating to RAM before the reboot. Something about heap and memory, and the log messages I mentioned in the first post in this thread.

memtest should throw out any RAM errors. If your server has ECC (what I really suppose) any one or two bit error in RAM will be detected, one bit errors should be corrected and usually logged somewhere in your servers BIOS.

You bought that server with it's RAM installed? According to the kingston-homepage, it has four banks. For best performance they recommand to install pairs.

Bolgard
Contributor
Contributor

Well, what I possibly could have done wrong with the RAM, I have done wrong now that I think of it.

  • The server came installed with 1 GB RAM.

  • I added 2 GB RAM (not certified by FS since they were alot cheeper, though I get a warning in BIOS about that. Figured it wouldn't matter that much, I have had several different RAM types in workstations and have always worked fine)

  • So there's totally 3 GB of RAM in 3 slots, so they're not grouper in pairs either.

  • And I'm not sure the added 2 GB is ECC (though I guess it has to be if the motherboards supports ECC?)

So we've found the problem there, don't you think?

0 Kudos
Jackobli
Virtuoso
Virtuoso

Well, what I possibly could have done wrong with the RAM, I have done wrong now that I think of it.

  • I added 2 GB RAM (not certified by FS since they were alot cheeper, though I get a warning in BIOS about that. Figured it wouldn't matter that much, I have had several different RAM types in workstations and have always worked fine)

The harder a machine is using it's RAM, the sooner you will see errors.

  • So there's totally 3 GB of RAM in 3 slots, so they're not grouper in pairs either.

Kingston says, it's not recommanded (performance deal), but should :smileyalert: work.

  • And I'm not sure the added 2 GB is ECC (though I guess it has to be if the motherboards supports ECC?)

Quite sure, that the server should at least throw out a BIG warning, but most probably it has to be ECC.

So we've found the problem there, don't you think?

I've seen so much problems with RAM, that I suppose, it is the source. But you never know.

So try with 1 GB or spend lots of bucks for original RAM or some bucks for compatible. I (nearly) newer had problems with Kingston RAM and they will at least try to support you. Dunno where you live, but I found 2x2 GBytes KFJ-E50/4G for US $112.00 at the Kingston shop.

View solution in original post

0 Kudos
Bolgard
Contributor
Contributor

Thanks for the help!

0 Kudos
charlesleaverdd
Contributor
Contributor

I have a Dell PowerEdge 6850 with the following specs:

  • 4x DUAL CORE XEON 7120M, 3.0GHZ, 4

  • 32GB (16X2GB DUAL RANK DIMMS) 400MHZ

  • 4x 73GB SCSI ULTRA320 15K HD 1IN 80 PIN HDD

  • EMBEDDED RAID

  • ESX 3i 3.5.0 build 123629.

On there are five virtual machines. The utilisation on it was never more than half of the total that it has. I had also assigned the system itself 3000MHz of CPU and 2048GB RAM in case it ever got into a situation where it needed that.

Despite that I am also experiencing this problem. The entries in my logs look like this:

Jan 2 05:31:12 196.x.x.x vmkernel: 48:13:18:46.449 cpu13:12172097)WARNING: Heap: 1397: Heap globalCartel already at its maximumSize. Cannot expand.

Jan 2 05:31:12 196.x.x.x vmkernel: 48:13:18:46.449 cpu13:12172097)WARNING: Heap: 1522: Heap_Align(globalCartel, 48/48 bytes, 4 align) failed. caller: 0x73a8ae

Jan 2 05:31:12 196.x.x.x vmkernel: 48:13:18:46.449 cpu13:12172097)WARNING: World: vm 12172104: 910: init fn user failed with: Out of memory!

Jan 2 05:31:12 196.x.x.x vmkernel: 48:13:18:46.449 cpu13:12172097)WARNING: World: vm 12172104: 1775: WorldInit failed: trying to cleanup.

The ESXi is not able to be logged into any of the three ways I have tried, those being the unsupported console on the actual machine itself, the RCLI or the Windows VIclient. The connection eventually times out.

The five virtual machines are all still running perfectly, however. Is there a correct way I should be apprioaching this? I have found absolutely nothing other than this post. If there is nothing I can do then I was going to bounce the ESX box, but I am extremely weary of that as all of those machines are mission critical boxes that can't go down.

Thanks in advance for any ideas or suggestions.

Cheers, Charles.

0 Kudos
Bolgard
Contributor
Contributor

Hi Charles,

I'm still experiencing this issue, and I have not solved it yet. Since I've been alone with this problem (as for as I've seen) I believed it to be a hardware fault, and this weekend I'll take the ESXi down for memtest.

Since I only have 3 GB of RAM I also had an idea that it might have something to do with resource pools, and VMs stealing all resources for the host. See my post here: http://communities.vmware.com/message/1139953, there are also some screenshots on my log file if you wish to compare.

I would advice not to reboot the host from Virtual Infrastructure Client, as it will hang the host and you will need to hard reset it (at least that was what happened to me). After a reboot, the problem reappears after a week or two.

What RAID-controller are you using (exact model)? Those 16 dual rank DIMMs, are they all of the same model and manufacturer?

EDIT: I would also advice not to shutdown or even reboot any of the VMs, as you won't be able to start them from VIC anymore (I'm presupposing you're having the exact same problem as I'm having; it's possible it does not apply to your situation).

0 Kudos
charlesleaverdd
Contributor
Contributor

I cannot do anything from Virtual Center as the box is not even connected anymore using VirtualCenter and there is absolutely no way for me to do anything to or with it because it is completely dead to any of the normal mechanisms that I would use (VirtualCenter, the RCLI and even the unsupported console on the physical machine). So I can only shut the Virtual Machines down by going onto them and typing halt. The RAID card is the built in card that comes with that system, which is a PERC 4i. The RAM is all original RAM and has never been added to or touched in any way, so yes, identical.

I am having the exact same problem as you, I'm pretty sure of that. So I see you have hard rebooted your box multiple times. Does it always come back fine? Does VirtualCenter get affected in any way? Do those hosts go back from being "Unknown" to being what they are really called?

By the way I suspect one of the virtual machines to be the cause of this, as it is continuously complaining about maxing its RAM. Which I think makes your suspicion of a memory leak likely. But hey it could be faulty RAM too. It's not terribly easy for me to find out as these virtual machines cannot go down and so I can't spend ages running memtest on the box. If I can hard reboot then I can move them off to my other ESX server and then sure, I can try memtest.

0 Kudos
Bolgard
Contributor
Contributor

Well, first of all, I'm not running the Virtual Center. I'm running the free version; ESXi 3.5 + Virtual Infrastructure Client (VIC). I can still connect via the VIC, but I cannot control the VMs. I can view logs, statistics and settings. I can also try to soft reboot the ESXi (put it in maintenance mode and reboot from VIC), but that just hangs the ESXi and I have to hard reset it. Because I can enter maintenance mode and therefore also shutdown the VMs I guess my hard reboots aren't that painful for the system, but the problem always reappears after a few days or a couple of weeks. Directly after the reboot, all status show OK instead of Unknown and everything is working as supposed to.

Interesting point it's a VM causing it! Now that I think of it, one of my Linux VMs aren't running VMware tools. Maybe that's what causing it? Where do you see the logs for indivudual VMs? All I can view is the logs for ESXi. This VM is also imported via VMware Converter; if your VMs also are imported and not running VMware tools, that might be our problem.

Reason I asked about RAID and RAM is I'm running on a Adaptec 3405 RAID card (which is on the I/O compatibility list, but I can't monitor it in VIC) and mixed brands of RAM modules and thought that was causing problems (although the same hardware have been running fine on a Windows-only box for more than a year).

0 Kudos
charlesleaverdd
Contributor
Contributor

All of mine are running vmware tools. Whether I connect using VirtualCenter or directly to the machine via the VIclient it will not connect. I have tried restarting the management agents.

I was previously able to connect via the RCLI and also using VIclient or VirtualCenter, but now it has deteriorated past being able to do anything with it at all. Very impressive that the virtual machines are absolutely 100% fine still.

You should find the logs for each machine, if no other way then at least by using the datastore browser and browsing to the directory in which that machines disk files are. There you will find logs for that machine. My VMs weren't importe. A few may have been clones of others but I don't think that's relevant. The others are clearly running fine whereas this one always complains. We blame the application that runs on there but the developer gets very offended when we do. Hopefully he's not reading this right now. Smiley Happy

0 Kudos
Bolgard
Contributor
Contributor

Don't blame the poor developer! Smiley Wink Even if it is a memory leak in his application, ESXi should not expand that VMs RAM beyond it's maximum limit anyway. And it should most definitely not lead to this problem with the host.

All my VMs are running perfectly fine aswell. How long did it take before you completely lost connection to the ESXi? I'm guessing I've been rebooting ESXi too frequently to have reached that "state".

I'll check my VMs logs as soon as I can access a PC with VIC again (about an hour or so). If I too find events about memory that would tell us it might be a specific VM which is causing our problem...

I doubt the problem is a faulty RAM module as we both have VMs running fine... eventually a faulty RAM area would be used by a VM, and we would have seen some issues at that level aswell.

EDIT: I can not download any logs for the VMs from the server, "failed to connect to nfc server". I guess this is related to the other management issues and the current state of the ESXi. Is there no other way to view the logs? From what I can see by looking at the statistics for each VM they all seem fine, memory is not used to it's limits. Maybe it ain't the VM that is causing it after all..

0 Kudos
Dave_Mishchenko
Immortal
Immortal

Have you tried running esxtop at the console to see if any process is using too much memory?

0 Kudos
Bolgard
Contributor
Contributor

No, I'll run that and memtest tomorrow when I can access the server physically again. Get back to you then. And charlesleaverdd can't because he can't access the ESXi host at all. Are you guys on any IRC-channel I can join?

0 Kudos
Dave_Mishchenko
Immortal
Immortal

I'm not on IRC but you can send me a PM as I would be interested in your results.

0 Kudos
Bolgard
Contributor
Contributor

Ok, so I can now physically access the server. I pressed Alt+F1, and first thing that met was a screen with the same message repeated:

"/etc/init.d/sfcbd-watchdog: /etc/init.d/sfcbd-watchdog: 309: Cannot fork"

"sh: 0: unknown operand"

I then typed "unsupported" and pressed Enter. Continuously the above messages keep popping up. At the login prompt a new message is shown:

"failure forking: Cannot allocate memory"

I entered my password, but I can't seem to access bash (and thus cannot enter any commands).

I'll have to do a reboot...

EDIT: Rebooted. Everything is looking fine now. Too bad I couln't view the esxtop when I had the problem. At the unsupported console there keep popping up messages saying "child still alive with a status of 0" and then on the next row "Killed". I don't know if this is normal. I have also enabled SSH-access to be able to check settings remotely next time. I'll take it offline for memtest now, we'll see what that will tell us.

EDIT2: Memtest have completed a pass now. "Pass complete, no errors, press Esc to exit". So problem is not with the RAM. I'm starting to believe this is a bug...

0 Kudos
Dave_Mishchenko
Immortal
Immortal

Do you have the Linux version of the RCLI (either the install or appliance)? It will have resxtop and I would suggest starting that up early this week and leaving it running to see what happens with memory. What build of ESXi do you have?

0 Kudos