We recently switched to an Equallogic 6100X array at our Co-Location. Use use Veeam to replicate a few servers to the equallogic through out the day. Since moving to this new array we are receiving errors from vCenter "Lost path redudndancy to stroage..." and it will report only of the equallogic volumes. Usually an RDM that is mounted to our Veeam Proxy server. We are using MEM 1.1.2. We often see well over 50ms latency when a snapshot is being removed from one of our replicas and this is usually when we received the error from vCenter.
Anyone experience a similar issue or have any ideas as to how to fix it?
What version of ESXi are you running?
It sounds like your storage system IS the problem.
Eventually, ESXi will consider a path suspect if it does not get the response time that it expects. Different versions of ESXi handle this differently. I have found that ESXi 4.1 does not handle these issues as gracefully as 5.x, I have hosts of both versions connected to the same datastores, and while my 4.x hosts tend to log "lost access to volume due to connectivity issues" during very high latency, the 5.x hosts generally lose the connection much less frequently, but log a bunch of latency warnings.
Take a closer look at your storage system during the time your snapshots are being removed and look for CPU and/or cache saturation, or disks that are maxxed out and can't handle the load being thrown at them.
I'm running 5.0 update 1. All hosts seem to report the issue within a few minutes of eachother. The weird thing is the volume they report losing redundancy isn't even being used. The only thing I can really see is high latency...
I can give you an example of what I have seen in the past. I had a Netapp filer that was being shared for SAN and NAS purposes. During backups of our NAS shares, and sometimes during snapshots, the I/O on the disks that were being accessed went beyond what they could handle, and causes the cache to constantly be flushed back to disk. Even though none of the VM datastores, which were on faster disks, were not being heavily accessed, the cache degradation overall on the storage system caused everything to suffer greatly in VMware.
You should probably get in touch with Dell if you haven't already and have them perform some diagnostics on their end to see if there is anything you can change to make this operation more optimal.
Thanks for the info. I opend a case with Dell this morning. I just noticed a strange behavior. I can see each drives load in SANHQ. I see that Drive 0 is being hit very hard (max iops and high read i/o)and the other 21 drives hardly at all.
I'll keep this thread updated with my findings.