We have two datacenters (Vcenter server in linked mode) - A & B
Servers in site A and B are being replicated over san (by storage end not SRM)
Notes -What is the known issue in the enviornment - SAN issues - everyday there are datastores disconnection events from storage.( Issue is 4 months old now.) performance issues in Windows servers.
To fix the issue HP adviced to upgrade controller firmwares on both sites ( 3 HP EVA contollers) which has been successfully completed however issue has not been resolved.
There are many outages caused by storage issue in the enviornment. As an example below server was unresponsive because of datastore being disconnected from storage.
Vmkernal logs indicates it's failing on physical path
Note- Storage array is Active active and Path policy for ESX host is round robin
What steps can be taken to remediate the issue ?
Thank you,
Deepak Negi
VCP4, VCP5
What about port enc out connected to EVA in SAN switches? Do you have any CRC errors and if yes, did the counter increase? Reset the counter if necessary so that you can monitor it.
What about EVA logs? Any excessive number of link and enclosure check condition errors?
Based on my experience, active/active storage like EVA should use "fixed" policy for multipath.
Thanx for your quick response
The issue is occuring since last 4 months. What you mean is changing multipathing policy to Fixed ? storage array is Active Active would that make any difference ?
hey -- how does the replication happen? is it synchronuous or asynchronuos ? synchronuos means the VMs/ESX hosts would have to wait little longer and perhaps are timing out?
how is the storage handling reservation conflicts? is the storage CPU peaking at 90% above all the time?
also one can see how busy the disks are at the SAN array
HTH.
~Sai Garimella
If this is what has been advised by your storage vendor, then just stick with it. Plus I believe some EVAs series now are ALUA compliant and should working find with round robin policy.
BTW, any counter error (ex: crc error) on the SAN switches? Have your SAN switches firmware matchs with controller firmware?
the scsi code in the error corresponds to a QFULL condition. Check the esxtop output of the device for any active queues, also monitor the performance of the storage array. You might also try to increase the queue depth value and see if it makes any difference...if I remember correctly default is 32 so 64 can be good test. have a look at the below article for further reference:
http://kb.vmware.com/selfservice/microsites/search.do?cmd=displayKC&externalId=1008113
SAN switches firmware matches with the Controller firmware. As per the firmware upgrade prerequisites we upgraded following in our environment
HP EVA is ALUA compliant and HP has recommended to use round robin multipath.
Firmware upgrade on controller increased number of issues.
What about port enc out connected to EVA in SAN switches? Do you have any CRC errors and if yes, did the counter increase? Reset the counter if necessary so that you can monitor it.
What about EVA logs? Any excessive number of link and enclosure check condition errors?
Replication is Asynchronous. CPU usage is more than 90% sometime. What I have noticed that datastore disconnection happens after the backup starts. This is to note that we do not have LAN free backup.
No we don't have CRC & enclosure check condition errors?
Excellent – then there’s your clue -- how exactly do you backup the data ?
You mentioned there’s a replication – is it possible to backup the data from the replicated LUNS…as they do not server live data?
It’s advisable to collect the disk utilization graphs , port queue lengths during the backup window …
HTH,
~Sai Garimella
Not all the LUNs are being replicated, secondly replicated Vdisks are not being accessed by any server it's only for the DR pupose.
Good idea to backup replicated Vdisks however other non replicated vms will need to be backed up over the network. It may reduce a lot of load
any other possible options ? Contoller CPU usage figure.
I can think of
1.add more CPUs
2.look out for some inhouse scripts which do thinngs like cat , on datastore files - this cause unnecessary reservation/release