VM's randomly unresponsive, ESX3.0.2, Proliant DL5...

DusanPohl · ‎01-20-2009

Hello all,

We are running 6 ESX's Proliant DL580 G5 in two clusters. Each cluster has three ESX's 3.0.2, 63195, HA activated and DRS activated - fully automatic.

Problem is that randomly VM's (all W2k3 standard) become unresponsive. When last VM became unresponsive in /var/log/vmware/hostd.log I found these records.

Ticket issued for mks connections to user: vpxuser

Current value 169812 exceeds soft limit 122880.

Ticket issued for mks connections to user: vpxuser

Current value 169812 exceeds soft limit 122880.

Propagating stats from interval 20 to 300

Has anyone any idea what may causing these systems to become unresponsive? Or if problem is more complex can you help me to figure out why last VM's became unresponsive?

Thanks for help.

weinstein5 · ‎01-20-2009

What is the configuration of your ESX Server - Memory, processors, storage etc? How many VMs are you running? What is the configuration of the VMS - # virtual processors, memory?

If you find this or any other answer useful please consider awarding points by marking the answer correct or helpful

Troy_Clavell · ‎01-20-2009

you may think about increasing your Service Console Memory, to see if this helps with your issues.

http://kb.vmware.com/kb/1003501

Rob_Bohmann1 · ‎01-20-2009

As Troy mentioned this message is concerning the service console memory.

DusanPohl · ‎01-21-2009

OK here it is:

All ESX servers are identical and configured as follows: HP ProLiant DL580 G5

1. Memory: Total 36861,3MB/System 2799,3MB/Virtual Machines/33790MB/Service Console 272MB

2.CPU: Model Intel Xeon / Processor Speed 2.1GHz / Sockets 4 / Cores per socket 4 / Logical 16 / Hyperthreading Disableb

3. Storage adapter: QLA2432

4. Storage: Fiber Channel SAN storage

Each ESX hosts 3 VM's - W2K3 Standard - configured - 2vCPU/4096MB/Overhead 144.32MB/ VMware tool/ each VM separate LUN 752GB

DusanPohl · ‎01-21-2009

thanks boys, will have a look at it and post here whether it helped.

DusanPohl · ‎01-26-2009

I increased console memory as suggested and additionaly only for troubleshooting purposes deactivated DRS. Messages are gone now but this morning

another VM became unresponsive.

Thus I checked logs and here is what I found in /var/log/vmkernel for that machine + plus now I'm going to check other logs as well.

Jan 26 11:35:32 swp-esx0005 vmkernel: 4:20:00:24.409 cpu10:1073)World: vm 1073: 3864: Killing self with status=0x0:Success

Jan 26 11:35:32 swp-esx0005 vmkernel: 4:20:00:24.409 cpu11:1074)World: vm 1074: 3864: Killing self with status=0x0:Success

Jan 26 11:35:32 swp-esx0005 vmkernel: 4:20:00:24.409 cpu8:1088)World: vm 1088: 3864: Killing self with status=0x0:Success

Jan 26 11:35:32 swp-esx0005 vmkernel: 4:20:00:24.409 cpu8:1087)World: vm 1087: 3864: Killing self with status=0x0:Success

Jan 26 11:35:32 swp-esx0005 vmkernel: 4:20:00:24.427 cpu13:1086)World: vm 1086: 3864: Killing self with status=0x0:Success

Jan 26 11:35:32 swp-esx0005 vmkernel: 4:20:00:24.453 cpu5:1091)World: vm 1091: 3864: Killing self with status=0x0:Success

Jan 26 11:35:32 swp-esx0005 vmkernel: 4:20:00:24.453 cpu6:1090)World: vm 1090: 3864: Killing self with status=0x0:Success

Jan 26 11:35:36 swp-esx0005 vmkernel: 4:20:00:28.335 cpu5:1071)World: vm 1114: 690: Starting world vmm0:SWP-VMMU004 with flags 8

Jan 26 11:35:36 swp-esx0005 vmkernel: 4:20:00:28.336 cpu5:1071)Sched: vm 1114: 4836: adding 'vmm0:SWP-VMMU004': group 'host/user': cpu: shares=-3 min=0 max=-1

Jan 26 11:35:36 swp-esx0005 vmkernel: 4:20:00:28.336 cpu5:1071)Sched: vm 1114: 4849: renamed group 14 to vm.1071

Jan 26 11:35:36 swp-esx0005 vmkernel: 4:20:00:28.336 cpu5:1071)Sched: vm 1114: 4863: moved group 14 to be under group 4

Jan 26 11:35:36 swp-esx0005 vmkernel: 4:20:00:28.342 cpu5:1071)Swap: vm 1114: 1426: extending swap to 4194304 KB

Jan 26 11:35:36 swp-esx0005 vmkernel: 4:20:00:28.350 cpu5:1071)World: vm 1115: 690: Starting world vmm1:SWP-VMMU004 with flags 8

What else should I do?

DusanPohl · ‎01-26-2009