VMware Cloud Community
Reefcrazed
Contributor
Contributor
Jump to solution

Horrible disk activity with guest off

I hope someone can help because this is getting out of hand. I used to run ESXi 3.5 last year and had to get away from it for this reason. Back then it was Debian 5.0 running, and one other small VM. A few times a day there would be really bad disk activity for maybe 5-10 minutes, enough that users would complain and then it would go away. Back then it was a single Western Digital drive. The thing is the disk activity meter in ESXi would not show any disk activity, like it was the host not the guest doing it.

Fast forward to today, I am giving ESXi another shot, this time with 4.0. This time, different machine. 8gb of DDR2, two VM's running, a total of 9tb of useable space on a Dell Perc 5i controller, Intel Quad Core processor. Been running okay for a few days, my raid finally finished syncing and no disk activity. So tonight I notice the drive lights are steady, major activity. So I check the performance chart, no disk activity for the guest or so it says. But the hard drive lights, all of them are going nuts. I check the Perc card and all drives are sync'd and the host spare is sitting idle...all is well. Well not really, users are complaining about disk latency again. I go ahead and shut ESXi down and reboot, the drive lights are still crazy even though all of the charts in ESXi and almost flatlined.

To give you an idea, the only two VM's that are running are Debian 5 and Windows Server 2003 but both are not showing any major activty.

My question is this. Does ESXi perform some type of garbage collection every so many hours ?

0 Kudos
1 Solution

Accepted Solutions
J1mbo
Virtuoso
Virtuoso
Jump to solution

As when ESX schedules time to the guest, it needs to provide as many cores as vCPU are assigned for that period of time. Within the guest, unless it is actually 100% on all CPU it sees, will be running OS idle[/b] thread on those cores. Hence the overall machine capacity is reduced, as they are not available for real work by other guests at that time.

Also ESX needs some CPU resource which is always bound to core 0. Say it is a quad-core box, assigning more than 2vCPU to a guest means that ESX cannot undertake work concurrently with that guest, let alone any other guest (work such as vSwitch, disk, preparing code for running etc etc).

HTH

Please award points to any useful answer.

View solution in original post

0 Kudos
8 Replies
J1mbo
Virtuoso
Virtuoso
Jump to solution

Hello. The physical activity may well be the perc-5 running background surface scans and nothing to do with ESX.

Is this machine using SATA disks? Does the perc-5 have the battery-backed write cache module installed?

Please award points to any useful answer.

0 Kudos
Reefcrazed
Contributor
Contributor
Jump to solution

Yes, they are SATA with TELR turned off.

I woke up this morning and they are no longer blinking heavy, just a flash every so many minutes.

The question is, what is causing it and how do I fix it ? That much activity makes the VM's useless and I know for a fact it is not the VM's.

On the PERC surface scan thing, I have never heard of that ? Is there a way of turning it off ?

0 Kudos
Reefcrazed
Contributor
Contributor
Jump to solution

And to answer the other two questions. Yes write back is enabled, and I do have a battery attached to the card.

I found nothing about PERC 5i doing background scans. Did you mean background initialization ? If so, that took around 4-5 days to complete and it is done now. Plus when that was going on my VM's ran fine, a little slower than normal but not like this.

0 Kudos
Reefcrazed
Contributor
Contributor
Jump to solution

I want to add one more thing. I started thinking about what could cause this, maybe this is it.

Hours before this happened I had started a live guest to Vm migration, maybe around 12pm. It probably finished while I was not at home. It had default to 4cpu and 2gb of ram. It was never powered on. I think my other VM's have maybe 4gb total dedicated to them.

I powered on this 2008 server Vm maybe around 7pm and noticed the problems around 10pm. With the host having 8gb of ram and the VM's being given around 6gb is that saturating the ram ? I mean the other VM's are given that much ram but are probably using less that 1gb at one given time, they are idle. Is this not best practice ?

0 Kudos
J1mbo
Virtuoso
Virtuoso
Jump to solution

Sounds like you have answered your own question. vSwapping has a devistating impact on performance. Ensure all guests have vmware Tools installed (the balloon driver allows ESX a much softer approach to recover RAM), do not overcommit RAM (leave 1.5GB for ESX itself) and do not assign multiple vCPU to guests except those that absolutely[/b] need it.

I'll dig out some references to the perc-5 background stuff later, I think it is configurable but not with ESX installed unfortunately.

The SATA array will perform much[/b] better as RAID-10 than RAID-5 as write performance is doubled (and you might be able to justify loosing the hot spare in that instance) and will be much safer too. As you have found initialising a large SATA Raid5 array takes days, which leaves the system very exposed to second disk failure.

Please award points to any useful answer.

Reefcrazed
Contributor
Contributor
Jump to solution

I read that somewhere else before....

Do not give each guest more than one cpu ? Why is this ? What is the impact ?

0 Kudos
J1mbo
Virtuoso
Virtuoso
Jump to solution

As when ESX schedules time to the guest, it needs to provide as many cores as vCPU are assigned for that period of time. Within the guest, unless it is actually 100% on all CPU it sees, will be running OS idle[/b] thread on those cores. Hence the overall machine capacity is reduced, as they are not available for real work by other guests at that time.

Also ESX needs some CPU resource which is always bound to core 0. Say it is a quad-core box, assigning more than 2vCPU to a guest means that ESX cannot undertake work concurrently with that guest, let alone any other guest (work such as vSwitch, disk, preparing code for running etc etc).

HTH

Please award points to any useful answer.

0 Kudos
J1mbo
Virtuoso
Virtuoso
Jump to solution

Hi,

The info on the background surface scans for the Perc-5 can be found here.

Dell terms this process "Patrol Read", and unless configured otherwise it runs continually, with a 7-day break between each iteration.

HTH

0 Kudos