VMware Cloud Community
g1xx3rb0y
Contributor
Contributor
Jump to solution

Weird ESX 3.5i VM unresponsive issue

Hi Everyone,

I have a very very weird issue on 2x HP DL380 G5 boxes with ESX 3.5i U3 loaded on each.

The setup:

DL380 G5

2x Quad core

2x 72GB Raid 1+0

6x 146GB Raid5

32 GB RAM

I installed esx 3.5i u3 as per normal, created vm's. But when I copy data from a network folder to the vm the cpu spikes and the vm is unresponsive. What makes it worse is if I copy a large file from the c: to another folder on c: the vm is unresponsive - cpu high and ping replies at about 200ms. the OS vmdk's are all on the 72GB Raid 1+0 datastore. When I add a data drive from the raid5 datastore the same happens. If I copy from the C to that drive on the raid 5 datastore the VM gets unresponsive and pings hover at about 200ms.

Any ideas?

I have no clue why this is happening, Ive deployed the same setup elsewhere with a bunch of vm's and no problems.

Reply
0 Kudos
1 Solution

Accepted Solutions
avlieshout
VMware Employee
VMware Employee
Jump to solution

I don't think its network related since you said you have the same problem copying from c: to c: on the same machine.

It looks like the machine is waiting for I/O to complete. Check if you can see if I/O is stalled somewhere

Are there any messages in the Windows system log indicating disk problems?

Check your RAID controller. Is it on the HCL?

Is your write cache configured properly?

-Arnim van Lieshout

-


Blogging:

Twitter:

If you find this information useful, please award points for "correct" or "helpful".

Arnim van Lieshout Blogging: http://www.van-lieshout.com Twitter: http://www.twitter.com/avlieshout If you find this information useful, please award points for "correct" or "helpful".

View solution in original post

Reply
0 Kudos
29 Replies
christianZ
Champion
Champion
Jump to solution

Vmware tools installed?

I would try to copy one vmdk file inside of VI client - datastore browser (vm powered off) - what speed on disk can you see?

Reply
0 Kudos
cmacmillan
Hot Shot
Hot Shot
Jump to solution

What is your vSwitch configuration? More to the point, does it match your network switch configuration? I'd also check for IP address conflicts (possibly hidden) since it's a new installation...

--Collin C. MacMillan

SOLORI - Solution Oriented, LLC

Collin C. MacMillan, VCP4/VCP5 VCAP-DCD4 Cisco CCNA/CCNP, Nexenta CNE VMware vExpert 2010-2012 SOLORI - Solution Oriented, LLC http://blog.solori.net If you find this information useful, please award points for "correct" or "helpful".
Reply
0 Kudos
g1xx3rb0y
Contributor
Contributor
Jump to solution

Hi Collin,

vSwitch config is simple - management network with a pNic connected and a VM network with a pNic connected. What I find strange is that even if i just copy from within the VM from one folder to another the vm starts being unresponsive and the cpu spikes.

I copied 2gb of files from one datastore to another and the time was about 3 minutes.

Reply
0 Kudos
Sanjana
Hot Shot
Hot Shot
Jump to solution

Hi g1xx3rb0y,

Welcome to the forums. What VMs (Linux/Windows) are these? And what is the virtual scsi controller type (LSILogic/BusLogic) in the VM?

--sanjana

Reply
0 Kudos
g1xx3rb0y
Contributor
Contributor
Jump to solution

Hi Sanjana,

The vm's are all windows 2003 R2 Sp2. Ive got a x64 and x86 vm's running. Im currently copying 3 GB of data within one of the x64 vm's to another disk - different datastore - CPU is at 100%.

controller - LSI logic

Reply
0 Kudos
kjb007
Immortal
Immortal
Jump to solution

Are other vm's on this host exhibiting similar behavior? I would reinstall the vmware tools.

-KjB

VMware vExpert 2009

vExpert/VCP/VCAP vmwise.com / @vmwise -KjB
Reply
0 Kudos
g1xx3rb0y
Contributor
Contributor
Jump to solution

2 different esx hosts - same result on all the vm's

Reply
0 Kudos
g1xx3rb0y
Contributor
Contributor
Jump to solution

See the screenshot

Reply
0 Kudos
kjb007
Immortal
Immortal
Jump to solution

Which NIC type are you using? Flexible/Enhanced? Did you install the vmware tools during creation/migration or after the vm was up and running? I would reinstall the tools before troubleshooting something else. Which process is pegging your CPU at 100%?

-KjB

VMware vExpert 2009

vExpert/VCP/VCAP vmwise.com / @vmwise -KjB
Reply
0 Kudos
g1xx3rb0y
Contributor
Contributor
Jump to solution

Nic Type on x64 vm - e1000 - on x86 - flexible

Ive reinstalled the vmware tools on the vm with same result

processes hogging the cpu: SYSTEM and explorer.exe

the vms are vanilla - no AV nothing, just latest patches.

Reply
0 Kudos
Sanjana
Hot Shot
Hot Shot
Jump to solution

Just some more troubleshooting- How much cpu/RAM are these VMs configured with? And do they have any reservations set?

Reply
0 Kudos
g1xx3rb0y
Contributor
Contributor
Jump to solution

Hi Sanjana,

I have no issue with troubleshooting...i am pulling at strings here...havent seen this yet, and I've deployed a couple of VM solutions...

anycase,

VM's configured as follow:

starting off with single vcpu

I assigned my x64 server with 4096mb ram (will probably go upto 6GB) - Exchange 2007 mbx server

I havent set any reservations whatsoever - default settings after vm creation

the host is running 4 VM's now - the other 3 VM's all running 1vcpu with 512mb ram

Reply
0 Kudos
kjb007
Immortal
Immortal
Jump to solution

I'd use the enhanced vmxnet driver instead of the e1000. It's a more tuned driver that may help in overall performance as well.

-KjB

VMware vExpert 2009

vExpert/VCP/VCAP vmwise.com / @vmwise -KjB
Reply
0 Kudos
g1xx3rb0y
Contributor
Contributor
Jump to solution

Hi KjB,

Ive set the network adapter type to enhanced...still the same problem. i have also in the meantime, reset the other esx server bios to defaults, just enabled vt, took a long shot. Tested the one x86 vm and looks much better...cpu hovering at about 13%. unfortunately I cannot do the same for the other host as the ILO isnt plugged wroking remotely, but I am busy building another vm on the host now to see if the bios defaults could have fixed the issue. I will then compare all the settings from the one with the other and see which setting is different...i am holding thumbs now!!

Reply
0 Kudos
g1xx3rb0y
Contributor
Contributor
Jump to solution

spoke too soon...seems to be intermittent now.

Reply
0 Kudos
cmacmillan
Hot Shot
Hot Shot
Jump to solution

What's going on in /var/log/messages while this is happening? Sounds like unsupported/uninitialized hardware blocking an I/O process. Try the same with a totally internal vSwitch for comparison (not married to pNIC)...

--Collin C. MacMillan

SOLORI - Solution Oriented, LLC

Collin C. MacMillan, VCP4/VCP5 VCAP-DCD4 Cisco CCNA/CCNP, Nexenta CNE VMware vExpert 2010-2012 SOLORI - Solution Oriented, LLC http://blog.solori.net If you find this information useful, please award points for "correct" or "helpful".
avlieshout
VMware Employee
VMware Employee
Jump to solution

I don't think its network related since you said you have the same problem copying from c: to c: on the same machine.

It looks like the machine is waiting for I/O to complete. Check if you can see if I/O is stalled somewhere

Are there any messages in the Windows system log indicating disk problems?

Check your RAID controller. Is it on the HCL?

Is your write cache configured properly?

-Arnim van Lieshout

-


Blogging:

Twitter:

If you find this information useful, please award points for "correct" or "helpful".

Arnim van Lieshout Blogging: http://www.van-lieshout.com Twitter: http://www.twitter.com/avlieshout If you find this information useful, please award points for "correct" or "helpful".
Reply
0 Kudos
g1xx3rb0y
Contributor
Contributor
Jump to solution

Hi guys,

We've tested the vswitch theory, we have completely removed the vnic from one of the vm's, only used console to manage. the same problem when copying from c to c. We have however noticed that the messages in the vmkernel is showing warnings about the CPU's, i have attached a screenshot. The cpu's are 2x Intel Xeon Quad core e5440. I know they are supported as I've deployed esx 3.5i on ml350 g5 which runs these models.

The raid controller is LSI SAS 1078-C2 - as far as I could see it is supported.We are going to try a firmware update for the controller.

Reply
0 Kudos
g1xx3rb0y
Contributor
Contributor
Jump to solution

Hi Everyone,

I "think" the issue has been resolved - we changed everything on a server - CPU, memory, raid controller, but not disks.....we still had the same issue - we also saw that the raid controllers do not have that black battery attached on both servers. So we update the BIOS revision and attached some spare batteries on both servers....we havent had an issue yet.....copies are fine...cpu does spike every now and then, but not to 100%, hovers between 13% and 30% on very large copies - 5GB>, and the vms arent unresponsive anymore.

personally it doesn't make alot of sense.....i am still monitoring the situation.

will provide feedback later.

Reply
0 Kudos