Re: ESXi 4 : virtual machines crash

Kh3ops · ‎04-20-2010

Hi everybody,

I'm getting a really annoying issue on VmWare ESXI 4. I'm running 8 virtual machines with pretty normal load (CPU, memory, disk).

Almost every week, all virtual machines suddenly become unavailable (no ping anymore). I'm still able to connect to the ESXi vSphere Client but I cannot reboot VMs.

The only option is to hard reboot the server (reboot from vSphere does not work either).

ESXi is running on a Dell PowerEdge chassis (Dual Xeon, 16 GB Ram).

Any idea where to look at?

Regards,

Gaëtan

Jackobli · ‎04-20-2010

Hello and welcome to the VMware ESXi community forum.

I'm getting a really annoying issue on VmWare ESXI 4. I'm running 8 virtual machines with pretty normal load (CPU, memory, disk).
Almost every week, all virtual machines suddenly become unavailable (no ping anymore). I'm still able to connect to the ESXi vSphere Client but I cannot reboot VMs.
ESXi is running on a Dell PowerEdge chassis (Dual Xeon, 16 GB Ram).
Any idea where to look at?

Everywhere

C'mon, tell us more:

- what kind of virtual machines (OS, one or more vCPU, how much memory)

- what kind of network (number of nics, connected to name the switch, using VLAN or not)

- what kind of storage system (local, nfs, SAN, Raid level)

Is this a supported server (hcl)?

Have you had a look for any newer BIOS and firmwares?

Kh3ops · ‎05-01-2010

Hi,

I'm running 8 VMs on this server :

1°) Gentoo OS : 1 CPU, 2GB RAM

2°) Gentoo OS : 2 CPU, 2GB RAM

3°) Gentoo OS : 4 CPU, 2GB RAM

4°) FreeBSD : 4 CPU, 2GB RAM

5°) Gentoo OS : 2 CPU, 3GB RAM

6°) Gentoo OS : 2 CPU, 3GB RAM

7°) Gentoo OS : 1 CPU, 2GB RAM

8°) Gentoo OS : 1 CPU, 2GB RAM

There is only one NIC per VM attached to main network (no VLAN).

Storage : Local hard-drive with thin-provisionning.

This morning, the server crashed again and I was able to get the following logs from syslog :

May 1 00:48:04 10.0.101.8 Hostd: Activation : Invoke done on

May 1 00:48:04 10.0.101.8 Hostd: Arg version:

"48"

May 1 00:48:04 10.0.101.8 Hostd: Throw vmodl.fault.RequestCanceled

May 1 00:48:04 10.0.101.8 Hostd: Result:

(vmodl.fault.RequestCanceled) {

dynamicType = <unset>,

faultCause = (vmodl.MethodFault) null,

msg = "",

}

May 1 00:48:04 10.0.101.8 Hostd: PendingRequest: HTTP Transaction failed, closing connection: N7Vmacore15SystemExceptionE(Connection reset by peer)

May 1 00:48:09 10.0.101.8 Hostd: Ticket issued for CIMOM version 1.0, user root

May 1 00:48:15 10.0.101.8 vmkernel: 10:06:30:59.197 cpu2:6520)NMP: nmp_CompleteCommandForPath: Command 0x2a (0x410005067900) to NMP device "naa.600508e00000000051c321908952fa08" failed on physical path "vmhba0:C1:T0:L0" H:0x8 D:0x0 P:0x0 Possible sense data: 0

This is the last message before I rebooted the server.

Thanks for helping,

Gaëtan

Jackobli · ‎05-01-2010

I'm running 8 VMs on this server :
1°) Gentoo OS : 1 CPU, 2GB RAM
2°) Gentoo OS : 2 CPU, 2GB RAM
3°) Gentoo OS : 4 CPU, 2GB RAM
4°) FreeBSD : 4 CPU, 2GB RAM
5°) Gentoo OS : 2 CPU, 3GB RAM
6°) Gentoo OS : 2 CPU, 3GB RAM
7°) Gentoo OS : 1 CPU, 2GB RAM
8°) Gentoo OS : 1 CPU, 2GB RAM

You got a dual xeon, assuming it's a quad core, so you got a total of 8 cores to use. But the third and fourth vm are allocating already all available cores. This configuration is not really recommended. Why this much vCPU?

You are overcommitting RAM too, there is 16 GB physical, your guests are using 18 GB and there's a hypervisor needing some too.

Storage : Local hard-drive with thin-provisionning.

You don't write abouth your raid controller (Cache, BBWC), your raidlevel (1, 10, 5, 6) and your type of harddisk (SCSI, SATA, SAS).

This morning, the server crashed again and I was able to get the following logs from syslog :
May 1 00:48:09 10.0.101.8 Hostd: Ticket issued for CIMOM version 1.0, user root
May 1 00:48:15 10.0.101.8 vmkernel: 10:06:30:59.197 cpu2:6520)NMP: nmp_CompleteCommandForPath: Command 0x2a (0x410005067900) to NMP device "naa.600508e00000000051c321908952fa08" failed on physical path "vmhba0:C1:T0:L0" H:0x8 D:0x0 P:0x0 Possible sense data: 0

Looks like your disk-subsystem/RAID-Controller had a failure / hickup. Possible overload due to configuration?

Possible steps:

reduce the allocated ram to a sum of 14 or 15 GByte.

FinFreeTX · ‎11-05-2010

Did you ever find a resolution to this issue? I'm having the exact same problem and I'm getting nowhere fast trying to troubleshoot...

Host:

ESXi 4.1.0 26027
Dell PowerEdge 2900
Dual Xeon Quad-Core's
16GB RAM
Local SATA RAID 5 array
1x 1GB NIC - no VLANs

Guests:

1x Windows Server 2008 x64 - 2 vCPU - 4088MB
1x Windows Server 2008 x32 - 1vCPU - 2048MB
5x Windows XP 32bit (only 2 used - others are always OFF unless
needed) - 1 vCPU each - 256MB each

Host BIOS and firmwares on latest versions, no hardware issues detected with Dell self diagnostics.

Please help??? Any suggestions would be greatly appreciated!

DSTAVERT · ‎11-08-2010

Welcome to the forums.

I would suggest that you create your own post in the ESXi 4 community (this is the ESXi 3.5 community). You will get far more exposure in the right community. Explain your problem in as much detail as you can.

-- David -- VMware Communities Moderator

All

ESXi 4 : virtual machines crash