Re: Has anybody had unexplained problems (saturati...

mimitte · ‎12-13-2010

Hi,

This problem concerns above all our storage, but as the major part of the activity I/O results from Vmware, this message can find its place here.

We have for several months a completely unexplained problem on our storage which appeared with the increase of the activity, although this one remains rather low (400 VM and less than 10.000 IOPS).

Regularly (several times a week), one or several CPU of the AMS goes to 100 % busy between some minutes and 2 hours, but there is no increase of the I/O activity to explain that.

When the CPUs are 100% busy, the LUN response times go up to 500, 800 ms or 1.5 seconds (very very bad!). When we have these saturations, all other parameters of the storage bay are OK : The write pending stays at 10 %, physical disks are busy under 40 %, no Array Group and no LUN presents overactivity ; only problem with CPU !!! (we're using "Tuning Manager" from HDS to measure all those parameters).

We have opened a case at HDS: support is working on our problem and another team with experts in performance ; for 2.5 months, these two teams did not find anything about the cause of the saturation of the CPU.

HDS ask us to remove all the SATA disk and replace them with the SAS, which has been done, but nothing changed

Then, HDS has asked us to reduce the queue depth of the HBA on ESX servers, what has been done, but nothing changed

All parameters were checked, re-checked re-re-checked by several people from HDS and fromVmware : everything seems OK

So far it is really a mystery !!

Does someone encountered a similar problem with storage AMS2000 series from HDS ?

remi

philoub · ‎02-04-2011

Hi,

We have similar problem with our AMS. The problem was due to SSL connection of the Management Tools. We desactivate SSL and everything goes fine.

BR

Philippe

Aramice · ‎02-04-2011

Hi,

This has already been verified : SSL is not activated.

The problem is still exists and come from synchronous replication (True Copy) : when we suspend pairs, everything is going right ; but we don't know why neither us nor HDS ....

remi

philoub · ‎02-18-2011

Hi,

Do you use HDS software: Device Manager; Replication Manager?

Do you use CCI? What is the format of horcm configuration file?

Bye

depping · ‎02-18-2011

I've seen something similar a while back which was related to a fabric switch that had issues with a buffer that got flooded at some point.

Duncan (VCDX)

Available now on Amazon: vSphere 4.1 HA and DRS technical deepdive

Aramice · ‎02-23-2011

Hi,

I don't use HDS software : no Device Manager, no Replication Manager ; only SNM2

I don't use CCI to manage pairs : only SNM2

remi

Aramice · ‎02-23-2011

Hi Duncan,

What you write can be interessant for me : this question about SAN has not be studied.

Our FAS version is quit old (5.3.1a). Do you remember more ?

thanks

depping · ‎02-23-2011

No unfortunately I cannot recall what the issue exactly was but it was a fabric switch that caused it and not the array itself.

Duncan (VCDX)

Available now on Amazon: vSphere 4.1 HA and DRS technical deepdive

benruset · ‎10-28-2011

Mimitte,

Did you find a solution to your issue? We're seeing the same thing in our environment.

Aramice · ‎10-29-2011

Hi benruset,

Do this problem occurs on AMS storage ? and are you using True Copy ?

If yes, have you got an idea about the number of Write IOPS on your storage ?

We have about 3000/4000 Write IOPS with True Copy Replication : this is a too heavy load for the AMS storage (only due to True Copy)

After several months and many "action plans" (migrate all SATA disks on SAS, changing queue depth on Vmware, ...) asked by support HDS, we've received an official letter from HDS which explain us that the Write IO with True Copy generate a too heavy load on the CPU of our AMS, and the only thing we can do is doing less True Copy, or changing our storage from Midrange (AMS) to High-End (VSP).

I Hope this will help you (when I had this problem and did my post, nobody told me something like that !!)

remi from Paris

benruset <communities-emailer@vmware.com>

28/10/2011 15:31

Veuillez répondre à

communities-emailer <communities-emailer@vmware.com>

A

<remi.thibieroz@aramice.fr>

cc

Objet

New message: "Has anybody had unexplained problems (saturation of CPU) with performance storage with AMS2500 from HDS ?"

VMware Communities<http://communities.vmware.com/index.jspa>

Has anybody had unexplained problems (saturation of CPU) with performance storage with AMS2500 from HDS ?

reply from benruset<http://communities.vmware.com/people/benruset> in VMware vSphere™ Storage - View the full discussion<http://communities.vmware.com/message/1853441#1853441

TheEsp · ‎11-01-2011

Hi mimitte

Very strange that HDS have not picked this up they are normally very good at finding the performance issues on an array.

A. Do you have Tuning Manager ? if not get it installed

B. Create a Raid group busy report > and narrow it down to the particular RAID group /LDEV

C.What are your RAID group configs are you using HDP or standard RAID groups

What type of app's did you running in the VMware Cluster ? if you running high write I/O workloads then SATA on the HDS platforms is going to cop an additonal check for every write to the array.

David

Aramice · ‎11-04-2011

Hi David,

Yes, we are using Tuning Manager for several years.

And we can see physical devices and Raid Groups (we don't use HDP on our AMS) never go over than 50 % busy.

We no longer have any SATA disks on our Array

The problem does not come from backend, only from CPU with True Copy

remi

dan_pu · ‎11-08-2011

Hi, Remi. I work with bruset.

We discovered a different source of our particular problem where only one cpu was doing most of the work and latency would greatly increase if that cpu was pegged to 100%.

Ultimately it was related to Truecopy, but not from IOPS load, from lun ownership. We use TC Extended with 5min intervals. A number of our first TCE pairs were (mis)conifigured to use one data pool. We have four data pools and all the pairs should be spread across all the pools. Since some pairs were not spread across the pools, the lun ownership of the luns would stick with that one pool, which in turn bind to one cpu core. Normally the AMS load balances the lun ownership across all four cores, but TCE'd luns do not load balance. They stay with the lun ownership of the data pool. If we manually moved the lun ownership of the misconfigured TCE luns, the 100% single core cpu behavior will follow it.

To remedy our situation we deleted the TCE pairs and recreated them with the proper data pool assignment. This immediately balanced our cpu load.

We also fixed our b2b credit of the switch F-Ports attached to our SAN. They had a setting of 4 where it's supposed to be 40. A portdisable/portenable fixed that, but it didn't help our situation.

Dan