VMware Cloud Community
KrishnaR
Enthusiast
Enthusiast

DS4000 SRM issues

I'm starting this threadto hear back from users and field on any experiences with DS4000 and SRM. Particularly interested in any issues or problems encountered. I've been working with SRM and DS4000 since beta and can try to help resolve any problems that've come up. I'm also working on an SRM guide but can't give a date on it yet.

0 Kudos
106 Replies
FG0711
Contributor
Contributor

attached. What is that I should be looking for in the logfiles ? thanks

0 Kudos
dex_1234
Contributor
Contributor

I was just trying to understand where you where at in your testing. As for the logs I'm just trying to correlate the srm logs with activities on the storage system, I'm mainly looking at your MEL log in the bundle.

0 Kudos
FG0711
Contributor
Contributor

dex, i have the LUNs synchronized now. i will try to follow your document for a test failover now

0 Kudos
dex_1234
Contributor
Contributor

The problems you're seeing during the protection group configuration and with datastore access may have been caused by the fact that you had a logical drive failover event occurring around 8:31 AM. The MEL log from "SAN2"

shows a mode select event(seq #1460, event#300D), this indicates that the ESX failover driver issued a lun failover request to controller A. Shortly thereafter we get the infamous "logical drive not on preferred path entry". I dug a little

deeper and it looks like the request came specifically from HBA port 2100001b32049edf on ESX2. For some reason the alternate HBA port lost communication with the storage subsystem at this point in time(2100001b3204a0df )

And if you look at the corresponding time frame in your vmware-dr log, it seems this is when SRM begins to start having issues:

2009-07-09 08:30:16.531 'RemoteSite 'Protected Site'' 3596 info Raising RemoteSitePingFailed event

2009-07-09 08:30:16.531 'RemoteSite 'Protected Site'' 3416 info Attempting to reconnect

2009-07-09 08:30:16.531 'RemoteSite 'Protected Site'' 3416 info Attempting to connect to remote site

2009-07-09 08:30:16.578 'LicenseManager' 3580 info FlexLM: The feature 'SRM_PROTECTED_HOST' is not present on the license server or has expired.

2009-07-09 08:30:17.515 'RemoteSite 'Protected Site'' 3416 warning Failed to connect to remote DR: Unexpected Vmacore::SystemException No connection could be made because the target machine actively refused it. (10061)

2009-07-09 08:31:16.484 'LocalSiteStatus' 3416 verbose Free disk space: 129438 Mb

2009-07-09 08:31:16.484 'LocalSiteStatus' 3416 verbose CPU usage: 1 %

2009-07-09 08:31:16.484 'LocalSiteStatus' 3416 verbose Available memory: 3462 Mb

2009-07-09 08:31:16.531 'RemoteSite 'Protected Site'' 3416 warning Failed to ping remote site

2009-07-09 08:31:17.515 'RemoteSite 'Protected Site'' 3580 info Attempting to reconnect

2009-07-09 08:31:17.515 'RemoteSite 'Protected Site'' 3580 info Attempting to connect to remote site

2009-07-09 08:31:18.593 'RemoteSite 'Protected Site'' 3580 warning Failed to connect to remote DR: Unexpected Vmacore::SystemException No connection could be made because the target machine actively refused it. (10061)

2009-07-09 08:32:16.484 'LocalSiteStatus' 3596 verbose Free disk space: 129438 Mb

2009-07-09 08:32:16.484 'LocalSiteStatus' 3596 verbose CPU usage: 0 %

2009-07-09 08:32:16.484 'LocalSiteStatus' 3596 verbose Available memory: 3461 Mb

2009-07-09 08:32:16.531 'RemoteSite 'Protected Site'' 3596 warning Failed to ping remote site

2009-07-09 08:32:18.593 'RemoteSite 'Protected Site'' 3576 info Attempting to reconnect

2009-07-09 08:32:18.593 'RemoteSite 'Protected Site'' 3576 info Attempting to connect to remote site

2009-07-09 08:32:19.515 'RemoteSite 'Protected Site'' 3576 warning Failed to connect to remote DR: Unexpected Vmacore::SystemException No connection could be made because the target machine actively refused it. (10061)

2009-07-09 08:33:16.484 'LocalSiteStatus' 3576 verbose Free disk space: 129438 Mb

2009-07-09 08:33:16.484 'LocalSiteStatus' 3576 verbose CPU usage: 0 %

2009-07-09 08:33:16.484 'LocalSiteStatus' 3576 verbose Available memory: 3457 Mb

2009-07-09 08:33:16.531 'RemoteSite 'Protected Site'' 3576 info Raising RemoteSiteDown event

2009-07-09 08:33:19.515 'RemoteSite 'Protected Site'' 3416 info Attempting to reconnect

2009-07-09 08:33:19.515 'RemoteSite 'Protected Site'' 3416 info Attempting to connect to remote site

2009-07-09 08:33:25.140 'RemoteSite 'Protected Site'' 3416 info VC Connection: Logging in as user 'administrator'

2009-07-09 08:33:25.203 'RemoteSite 'Protected Site'' 3416 info VC Connection: Logged in session 6343D5BF-3917-469F-ABBA-3ACDC4A290AB

2009-07-09 08:33:26.140 'RemoteSite 'Protected Site'' 3416 warning Failed to connect to remote DR: Unexpected exception 'class Vmacore::Http::HttpException' HTTP error response: Service Unavailable

2009-07-09 08:33:26.140 'RemoteSite 'Protected Site'' 3416 verbose VC Connection: Logging out session 6343D5BF-3917-469F-ABBA-3ACDC4A290AB

2009-07-09 08:33:26.203 'RemoteSite 'Protected Site'' 3416 verbose VC Connection: Logged out session 6343D5BF-3917-469F-ABBA-3ACDC4A290AB

2009-07-09 08:34:16.484 'LocalSiteStatus' 3596 verbose Free disk space: 129435 Mb

2009-07-09 08:34:16.484 'LocalSiteStatus' 3596 verbose CPU usage: 3 %

2009-07-09 08:34:16.484 'LocalSiteStatus' 3596 verbose Available memory: 3416 Mb

2009-07-09 08:34:16.531 'RemoteSite 'Protected Site'' 3596 warning Failed to ping remote site

2009-07-09 08:34:16.578 'LicenseManager' 3596 verbose FlexLM: Server Available.

2009-07-09 08:34:16.578 'LicenseManager' 3596 info FlexLM: The feature 'SRM_PROTECTED_HOST' is not present on the license server or has expired.

2009-07-09 08:34:16.640 'HostLicenseMonitor' 3596 verbose Checking Host Licenses

I'm not sure what caused the initial failover, but make sure you're following the multi-pathing guidelines laid out by vmware for ds storage and if you haven't done so already redistribute the logical drives back to its original controller via storagemanager.

0 Kudos
FG0711
Contributor
Contributor

Dex_1234,

Thanks again for your detailed post. Even though I now have SRM working, I am looking into my fabric connections to see why sometimes both of my SANs switch from one controller to another one. We were forced to implement a redundant fabric using two Cisco MDS9124 switches on each side, but then connected them into a single fabric by connecting the fabric into a single one at the Protected side to accomodate DS4700-70 situation ( one available port per controller with replication on). This was suggested by a post here earlier.

Other small issues I would like to understand are:

1. It looks like SRM adds a UUID number in front of the datastore name assigned for recovery: srm_datastore -> uuid_srm_datasore.

2. It looks like SRM initiates a reverse replication once the failover is complete and the new protected site is up.

I would like to keep my name assignments and not to start replication in the opposite direction automatically.

I would appreciate any feedback, Thanks

0 Kudos
FG0711
Contributor
Contributor

Dex,

After studing your recent post, I tried the following:

1. Separated Controller A, port 2 of Protected storage and Controller A, por2 of recovery storage into one zone and Controller B, p2 of protected storage and Controller B, port2 of recovery storage into a different zone. Both zones are on the same VSAN.

2. Above resulted in inability to set up a mirrored pair. I get an error message - saying can not esablish communication to the other side of a mirrored pair

Therefore, I had to combied all for FC ports in (1) into the same zone.

I understnd that may have created a problem of a controller failover I've seen, but the system still works OK .

I tested failover sevaral times and everything looks good. Thanks again for your very usefull comments.

0 Kudos
mrenna
Contributor
Contributor

Hi there!

I'm resuming this thread because of reading it I've found a lot of interesting tips (and troubleshoots!).

Now, hoping that someone is still following this thread, I post a log of a Recovery site on which I continue to have

test failure...

Specifically I've 2 sites (esx 3.5 cluster) with SRM 1 upd 5, IBM SRA last version....on both sites with 2 DS4700 - mod 72.

Test fails with "Error: Failed

to recover datastore: " ...I've already followed kb about edit config.xml

(log is related to some tests on a protection group with only one vm protected, but with all of the vm's i got the same results).

Thanks you in advance for any comment!

mrenna

0 Kudos