VMware Cloud Community
DeepakNegi420
Contributor
Contributor
Jump to solution

Datastore disconnection from storage

We have two datacenters (Vcenter server in linked mode) -  A & B

Servers in site A and B are being replicated over san (by storage end not SRM)

Notes -What is the known issue in the enviornment - SAN issues - everyday there are datastores disconnection events from storage.( Issue is 4 months old now.) performance issues in Windows servers.

To fix the issue HP adviced to upgrade controller firmwares on both sites ( 3 HP EVA contollers) which has been successfully completed however issue has not been resolved.

There are many outages caused by storage issue in the enviornment. As an example below server was unresponsive because of datastore being disconnected from storage.

storage event.jpgCSC server.jpg

Vmkernal logs indicates it's failing on physical path

vmkernal logs2.jpg

Note- Storage array is Active active and Path policy for ESX host is round robin

What steps can be taken to remediate the issue ?

Thank you,

Deepak Negi

VCP4, VCP5

Regards, Deepak Negi
Reply
0 Kudos
1 Solution

Accepted Solutions
athlon_crazy
Virtuoso
Virtuoso
Jump to solution

What about port enc out connected to EVA in SAN switches? Do you have any CRC errors and if yes, did the counter increase? Reset the counter if necessary so that you can monitor it.

What about EVA logs? Any excessive number of link and enclosure check condition errors?

http://www.no-x.org

View solution in original post

Reply
0 Kudos
12 Replies
athlon_crazy
Virtuoso
Virtuoso
Jump to solution

Based on my experience, active/active storage like EVA should use "fixed" policy for multipath.

http://www.no-x.org
Reply
0 Kudos
DeepakNegi420
Contributor
Contributor
Jump to solution

Thanx for your quick response

The issue is occuring since last 4 months. What you mean is changing multipathing policy to Fixed ? storage array is Active Active would that make any difference ?

Regards, Deepak Negi
Reply
0 Kudos
SG1234
Enthusiast
Enthusiast
Jump to solution

hey -- how does the replication happen? is it synchronuous or asynchronuos ? synchronuos means the VMs/ESX hosts would have to wait little longer and perhaps are timing out?

how is the storage handling reservation conflicts? is the storage CPU peaking at 90% above all the time?

also one can see how busy the disks are at the SAN array

HTH.

~Sai Garimella

Reply
0 Kudos
athlon_crazy
Virtuoso
Virtuoso
Jump to solution

If this is what has been advised by your storage vendor, then just stick with it. Plus I believe some EVAs series now are ALUA compliant and should working find with round robin policy.

BTW, any counter error (ex: crc error) on the SAN switches? Have your SAN switches firmware matchs with controller firmware?

http://www.no-x.org
Reply
0 Kudos
vGuy
Expert
Expert
Jump to solution

the scsi code in the error corresponds to a QFULL condition. Check the esxtop output of the device for any active queues, also monitor the performance of the storage array. You might also try to increase the queue depth value and see if it makes any difference...if I remember correctly default is 32 so 64 can be good test. have a look at the below article for further reference:

http://kb.vmware.com/selfservice/microsites/search.do?cmd=displayKC&externalId=1008113

Reply
0 Kudos
DeepakNegi420
Contributor
Contributor
Jump to solution

SAN switches firmware matches with the Controller firmware. As per the firmware upgrade prerequisites we upgraded following in our environment

  • HBA drivers & firmware on Windows\Linux\ ESX host
  • SAN switches firmware upgrade
  • Disk block size from 32MB to 128KB

HP EVA is ALUA compliant and HP has recommended to use round robin multipath.

Firmware upgrade on controller increased number of issues.

Regards, Deepak Negi
Reply
0 Kudos
athlon_crazy
Virtuoso
Virtuoso
Jump to solution

What about port enc out connected to EVA in SAN switches? Do you have any CRC errors and if yes, did the counter increase? Reset the counter if necessary so that you can monitor it.

What about EVA logs? Any excessive number of link and enclosure check condition errors?

http://www.no-x.org
Reply
0 Kudos
DeepakNegi420
Contributor
Contributor
Jump to solution

Replication is Asynchronous. CPU usage is more than 90% sometime. What I have noticed that datastore disconnection happens after the backup starts. This is to note that we do not have LAN free backup.

Storage_CPU1.JPG

Regards, Deepak Negi
Reply
0 Kudos
DeepakNegi420
Contributor
Contributor
Jump to solution

No we don't have CRC & enclosure check condition errors?

Regards, Deepak Negi
Reply
0 Kudos
SG1234
Enthusiast
Enthusiast
Jump to solution

Excellent – then there’s your clue -- how exactly do you backup the data ?

You mentioned there’s a replication – is it possible to backup the data from the replicated LUNS…as they do not server live data?

It’s advisable to collect the disk utilization graphs , port queue lengths during the backup window …

HTH,

~Sai Garimella

Reply
0 Kudos
DeepakNegi420
Contributor
Contributor
Jump to solution

Not all the LUNs are being replicated, secondly replicated Vdisks are not being accessed by any server it's only for the DR pupose.

Good idea to backup replicated Vdisks however other non replicated vms will need to be backed up over the network. It may reduce a lot of load

any other possible options ? Contoller CPU usage figure.

Storage_EVAC1.JPGStorage_EVAK1.JPGStorage_EVAK2.JPG

Regards, Deepak Negi
Reply
0 Kudos
SG1234
Enthusiast
Enthusiast
Jump to solution

I can think of

1.add more CPUs

2.look out for some inhouse scripts which do thinngs like cat , on datastore files - this cause unnecessary reservation/release

Reply
0 Kudos