VMware Cloud Community
Kh3ops
Contributor
Contributor

ESXi 4 : virtual machines crash

Hi everybody,

I'm getting a really annoying issue on VmWare ESXI 4. I'm running 8 virtual machines with pretty normal load (CPU, memory, disk).

Almost every week, all virtual machines suddenly become unavailable (no ping anymore). I'm still able to connect to the ESXi vSphere Client but I cannot reboot VMs.

The only option is to hard reboot the server (reboot from vSphere does not work either).

ESXi is running on a Dell PowerEdge chassis (Dual Xeon, 16 GB Ram).

Any idea where to look at?

Regards,

Gaëtan

0 Kudos
5 Replies
Jackobli
Virtuoso
Virtuoso

Hello and welcome to the VMware ESXi community forum.

I'm getting a really annoying issue on VmWare ESXI 4. I'm running 8 virtual machines with pretty normal load (CPU, memory, disk).

Almost every week, all virtual machines suddenly become unavailable (no ping anymore). I'm still able to connect to the ESXi vSphere Client but I cannot reboot VMs.

ESXi is running on a Dell PowerEdge chassis (Dual Xeon, 16 GB Ram).

Any idea where to look at?

Everywhere Smiley Wink

C'mon, tell us more:

- what kind of virtual machines (OS, one or more vCPU, how much memory)

- what kind of network (number of nics, connected to name the switch, using VLAN or not)

- what kind of storage system (local, nfs, SAN, Raid level)

Is this a supported server (hcl)?

Have you had a look for any newer BIOS and firmwares?

0 Kudos
Kh3ops
Contributor
Contributor

Hi,

I'm running 8 VMs on this server :

1°) Gentoo OS : 1 CPU, 2GB RAM

2°) Gentoo OS : 2 CPU, 2GB RAM

3°) Gentoo OS : 4 CPU, 2GB RAM

4°) FreeBSD : 4 CPU, 2GB RAM

5°) Gentoo OS : 2 CPU, 3GB RAM

6°) Gentoo OS : 2 CPU, 3GB RAM

7°) Gentoo OS : 1 CPU, 2GB RAM

8°) Gentoo OS : 1 CPU, 2GB RAM

There is only one NIC per VM attached to main network (no VLAN).

Storage : Local hard-drive with thin-provisionning.

This morning, the server crashed again and I was able to get the following logs from syslog :

May 1 00:48:04 10.0.101.8 Hostd: Activation : Invoke done on

May 1 00:48:04 10.0.101.8 Hostd: Arg version:

"48"

May 1 00:48:04 10.0.101.8 Hostd: Throw vmodl.fault.RequestCanceled

May 1 00:48:04 10.0.101.8 Hostd: Result:

(vmodl.fault.RequestCanceled) {

dynamicType = <unset>,

faultCause = (vmodl.MethodFault) null,

msg = "",

}

May 1 00:48:04 10.0.101.8 Hostd: PendingRequest: HTTP Transaction failed, closing connection: N7Vmacore15SystemExceptionE(Connection reset by peer)

May 1 00:48:09 10.0.101.8 Hostd: Ticket issued for CIMOM version 1.0, user root

May 1 00:48:15 10.0.101.8 vmkernel: 10:06:30:59.197 cpu2:6520)NMP: nmp_CompleteCommandForPath: Command 0x2a (0x410005067900) to NMP device "naa.600508e00000000051c321908952fa08" failed on physical path "vmhba0:C1:T0:L0" H:0x8 D:0x0 P:0x0 Possible sense data: 0

This is the last message before I rebooted the server.

Thanks for helping,

Gaëtan

0 Kudos
Jackobli
Virtuoso
Virtuoso

I'm running 8 VMs on this server :

1°) Gentoo OS : 1 CPU, 2GB RAM

2°) Gentoo OS : 2 CPU, 2GB RAM

3°) Gentoo OS : 4 CPU, 2GB RAM

4°) FreeBSD : 4 CPU, 2GB RAM

5°) Gentoo OS : 2 CPU, 3GB RAM

6°) Gentoo OS : 2 CPU, 3GB RAM

7°) Gentoo OS : 1 CPU, 2GB RAM

8°) Gentoo OS : 1 CPU, 2GB RAM

You got a dual xeon, assuming it's a quad core, so you got a total of 8 cores to use. But the third and fourth vm are allocating already all available cores. This configuration is not really recommended. Why this much vCPU?

You are overcommitting RAM too, there is 16 GB physical, your guests are using 18 GB and there's a hypervisor needing some too.

Storage : Local hard-drive with thin-provisionning.

You don't write abouth your raid controller (Cache, BBWC), your raidlevel (1, 10, 5, 6) and your type of harddisk (SCSI, SATA, SAS).

This morning, the server crashed again and I was able to get the following logs from syslog :

May 1 00:48:09 10.0.101.8 Hostd: Ticket issued for CIMOM version 1.0, user root

May 1 00:48:15 10.0.101.8 vmkernel: 10:06:30:59.197 cpu2:6520)NMP: nmp_CompleteCommandForPath: Command 0x2a (0x410005067900) to NMP device "naa.600508e00000000051c321908952fa08" failed on physical path "vmhba0:C1:T0:L0" H:0x8 D:0x0 P:0x0 Possible sense data: 0

Looks like your disk-subsystem/RAID-Controller had a failure / hickup. Possible overload due to configuration?

Possible steps:

  • reduce the allocated ram to a sum of 14 or 15 GByte.

0 Kudos
FinFreeTX
Contributor
Contributor

Did you ever find a resolution to this issue? I'm having the exact same problem and I'm getting nowhere fast trying to troubleshoot...

Host:

  • ESXi 4.1.0 26027

  • Dell PowerEdge 2900

  • Dual Xeon Quad-Core's

  • 16GB RAM

  • Local SATA RAID 5 array

  • 1x 1GB NIC - no VLANs

Guests:

  • 1x Windows Server 2008 x64 - 2 vCPU - 4088MB

  • 1x Windows Server 2008 x32 - 1vCPU - 2048MB

  • 5x Windows XP 32bit (only 2 used - others are always OFF unless
    needed) - 1 vCPU each - 256MB each

Host BIOS and firmwares on latest versions, no hardware issues detected with Dell self diagnostics.

Please help??? Any suggestions would be greatly appreciated!

0 Kudos
DSTAVERT
Immortal
Immortal

Welcome to the forums.

I would suggest that you create your own post in the ESXi 4 community (this is the ESXi 3.5 community). You will get far more exposure in the right community. Explain your problem in as much detail as you can.

-- David -- VMware Communities Moderator
0 Kudos