Re: Poor Ping Time

RogerAli · ‎02-14-2007

Hey all,

We've got two identical ESX Servers with the following configuration:

Dell 2950

2x 2.6 Ghz Core 2 CPUs

16 GB RAM

4x GB Nics (2 Intel, 2 Broadcom)

Emulex HBA to Clarion 600 SAN

ESX Server 3.0.1

We've got about 10-12 VMs running in this ESX Farm. Recently we've been getting complaints of poor performance when working on servers (ePO, Terminal Server, IIS, etc.)

Concerned about this, we've check our Hardware (cabling, switches, etc.) Nothing came up, so I tried a simple ping tests from (1) a V2V on the same farm, (2) a P2V. The results are as follows:

(1) V2V

Reply from 131.247.89.169: bytes=32 time=16ms TTL=128

Reply from 131.247.89.169: bytes=32 time<10ms TTL=128

Reply from 131.247.89.169: bytes=32 time=35ms TTL=128

Reply from 131.247.89.169: bytes=32 time=15ms TTL=128

Reply from 131.247.89.169: bytes=32 time<10ms TTL=128

Reply from 131.247.89.169: bytes=32 time=9ms TTL=128

Reply from 131.247.89.169: bytes=32 time=19ms TTL=128

Reply from 131.247.89.169: bytes=32 time=14ms TTL=128

Reply from 131.247.89.169: bytes=32 time<10ms TTL=128

Reply from 131.247.89.169: bytes=32 time<10ms TTL=1281

(2) P2V

Reply from 131.247.89.53: bytes=32 time=5ms TTL=127

Reply from 131.247.89.53: bytes=32 time=1ms TTL=127

Reply from 131.247.89.53: bytes=32 time=3ms TTL=127

Reply from 131.247.89.53: bytes=32 time=1ms TTL=127

Reply from 131.247.89.53: bytes=32 time<1ms TTL=127

Reply from 131.247.89.53: bytes=32 time=53ms TTL=127

Reply from 131.247.89.53: bytes=32 time<1ms TTL=127

Reply from 131.247.89.53: bytes=32 time=3ms TTL=127

Reply from 131.247.89.53: bytes=32 time=18ms TTL=127

The VM's are using the VMware Accelerated AMD PCNet Adapter and are either 2K or 2K3 (both version of the OS have the issue). I am assuming this adapter is the VMXNet adapter that is supposedly more robust.

Reading through the forums I also found that 2 CPUs VMs might lead to performance issues, so I've minimized those to only VMs that require 2 CPUs. I read that doing an ESXTOP at the console will give me some information. I have and the results are:

5:13:34pm up 22 days, 53 min, 73 worlds; CPU load average: 0.17, 0.16, 0.17

PCPU(%): 51.95, 62.02, 18.91, 42.43 ; used total: 43.83

CCPU(%): 0 us, 33 sy, 67 id, 0 wa ; cs/sec: 633

ID GID NAME NMEM %USED %SYS %OVRLP %RUN %WAIT

1 1 idle 4 209.53 0.00 0.00 0.69 0.00

2 2 system 5 0.00 0.00 0.00 0.00 458.15

6 6 console 1 29.14 0.00 0.02 29.14 46.95

7 7 helper 13 0.31 0.00 0.00 0.31 1190.89

8 8 drivers 7 0.00 0.00 0.00 0.00 641.73

9 9 vmotion 1 0.00 0.00 0.00 0.00 91.68

15 15 vmware-vmkauthd 1 0.00 0.00 0.00 0.00 91.69

93 93 ITTS2 7 42.80 0.00 0.07 44.27 502.28

97 97 Rainier 5 72.54 0.11 0.20 71.61 357.15

105 105 Pilot 5 1.30 0.00 0.01 1.30 436.28

107 107 ProjectDev 5 1.35 0.00 0.01 1.35 441.09

113 113 Wolverine 7 1.89 0.00 0.00 1.89 621.70

117 117 PanaceaDev 7 1.94 0.00 0.01 1.94 619.99

127 127 Shugo 5 4.24 0.00 0.02 4.24 408.19

I keep reading about the %Ready that will help me understand if I have issues with my configuration, but I am unable to find that field to compare/analyze my VMs.

Please help me out with these issues, I'm trying to get this configured correctly before we add another 2950 to the mix.

Roger

RParker · ‎02-14-2007

You got the same problem we have. You have a Core 2 processor, but it's not DUAL CORE. We have dual processor, and we try to maintain a 4 VM per CORE ratio. You are approaching that yourself.

That's where the knee bend is occuring, I/O. It's not CPU, Memory, or even Network, its disk. Your disk is frankly being spread thin across ALL the VM's.

The magic question: What to do about it.

A: There isn't much you can do, this is the biggest downfall of VM, I/O performance takes a hit because you ONLY have 1 RAID shared across ALL vM's. The only way to improve this is to put your VM's on a SAN lun, that's what we are exploring, you will get much better performance, but at a price.

For now, there isn't much you can do, priortize the VM's and try to figure out which ones get higher shares. I struggle with this every single day.

It's a chore believe me, I keep trying to figure out a solution, but there is only so much you can do with 2 processors. Quad core is another story, but you still have to contend with a single RAID controller. That's the bottle neck.

RogerAli · ‎02-14-2007

RParker,

The Core 2 CPU is actually 2 cores on one socket. The 2950 has 2 sockets equating to 4 cores per system. Regardless, we only have about 5-6 VMs per server (2 2950s in our Farm). We also have a SAN backbone to our ESX Servers using Emulex HBAs.

Our I/O isn't getting much traffic at all from the SAN's perspective. The same holds true from our LAN side. From our switches, our Network team can't see any collisions, errors, etc.

Roger

RParker · ‎02-14-2007

Oh, ok. Well I hate to admit it, but maybe since we have similar hardware and manufacture, and other people on here haven't had these types of issues, the Dell machine may have some incompatibilities with ESX then.

All of our servers are Dell, and I have tried everything, rebuilding, re-install, CPU affinity, moving VM's around, trying different configurations with SCSI (some are RAID some are SCSI) and I get the same result.

I have thrown my hands up. I spent 2 months working 12-14 hour days, just before Christmas, taking advantage of people on vacation, and I can't get it to clear up. WE have been trying VI3, same result, the I/O is the only thing causing us gripe.

Other people have mentioned that the PERC cards in these machines is a glorified Adaptec controller and won't give the best performance, but I know you use a SAN, but the only thing left is the infrastructure, which is DELL. WE haven't gotten our 2950 Quad Core yet, so I can't test my theory.. but it sure seems like these PE machines aren't giving us the return we were looking for.

mstahl75 · ‎02-15-2007

I keep reading about the %Ready that will help me understand if I have issues with my configuration, but I am unable to find that field to compare/analyze my VMs.

%RDY is the 15th column in the esxtop output. That ouput doesn't wrap, if I recall correctl, so if your SSH application doesn't allow you to stretch your window you won't see it by default.

There are some options in esxtop (type ? when it is running) to reorder and/or add/remove entries.

christianZ · ‎02-15-2007

Your console is a little busy, or I'm wrong ?

Have you checked nic auto sense (/var/log/vmkernel)?

You can run a few iometer tests in your vms.

RogerAli · ‎03-16-2007

Mstahl75,

I apologize for my delay. I finally got the chance to get proper readouts on ESXTop. I want to say 90% of the time, we have all our servers showing a %RDY under 2.00 (most under 1.00). The concern is rather frequently (1-3 times every 5 mins) we get a spike on some servers. I was able to catch this spike in my Putty Session and its posted below:

9:45:17am up 17 days, 7:08, 57 worlds; CPU load average: 0.07, 0.07, 0.07

PCPU(%): 6.32, 8.29, 2.87, 4.82 ; used total: 5.58

CCPU(%): 0 us, 1 sy, 99 id, 0 wa ; cs/sec: 132

ID GID NAME NMEM %USED %SYS %OVRLP %RUN %WAIT %BWAIT %TWAIT %CRUN %CSTP %IDLE %RDY %EXTRA %MLMTD

1 1 idle 4 366.97 0.00 0.00 1.80 0.00 0.00 0.00 0.00 0.00 0.00 387.25 0.00 0.00

102 102 Clio2 5 12.99 0.00 0.04 13.00 342.32 50.24 392.56 0.00 0.00 40.83 80.73 10.14 79.56

103 103 Rainier 5 2.02 0.00 0.08 2.03 386.50 96.69 483.18 0.00 0.00 94.29 1.08 0.46 0.00

104 104 ITTS2 5 2.92 0.00 0.04 2.93 423.92 58.80 482.72 0.00 0.00 93.90 0.63 1.31 0.00

108 108 Civic 5 1.58 0.02 0.03 1.56 442.98 41.68 484.65 0.00 0.00 95.96 0.09 1.13 0.00

105 105 CMSDev 5 0.99 0.01 0.03 0.98 403.50 81.79 485.29 0.00 0.00 96.06 0.03 0.59 0.00

6 6 console 1 1.82 0.00 0.01 1.85 59.73 35.66 95.39 0.00 0.00 95.38 0.02 0.79 0.00

7 7 helper 13 0.02 0.00 0.00 0.02 1264.27 0.00 1264.27 0.00 0.00 0.00 0.01 0.00 0.00

8 8 drivers 7 0.00 0.00 0.00 0.00 680.84 0.00 680.84 0.00 0.00 0.00 0.00 0.00 0.00

2 2 system 5 0.00 0.00 0.00 0.00 486.34 0.00 486.34 0.00 0.00 0.00 0.00 0.00 0.00

9 9 vmotion 1 0.00 0.00 0.00 0.00 97.26 0.00 97.26 0.00 0.00 0.00 0.00 0.00 0.00

21 21 vmware-vmkauthd 1 0.00 0.00 0.00 0.00 97.26 0.00 97.26 0.00 0.00 0.00 0.00 0.00 0.00

The spike in %RDY usually lasts for 1-2 mins and then it evens out and the server drops back down to < 2.00. During these spikes, I've also noticed we get the poor ping response and complaints of server slowdown. I think I read a %RDY > 10.00 is bad, is this correct?

My machines are in Resource Pools that only use a limit (no shares, no reservations). Can you shed some light on this issue. These 2 2950's in our Cluster were suggested by VMWare resellers using Capacity Planner to handle 19 servers (we only used candidates they suggested for VMs). We've got 12 on there now and I'm afraid of adding more to the mix due to the performance issues. What do you think?

Roger

RogerAli · ‎03-19-2007

ChristianZ,

How would I know my console is busy. I apologize for my lack of knowledge on this, I'm not a pro at Linux, just played with it a bit.

Is there a line I look at in ESXTop or should I be running some other application to check this?

Roger