I'm in trouble with an installation of vSphere cluster.
The infra is composed with BL460c G7 servers installed with ESX 4.1U1
The storage is two datacore SSY 7 psp4 with on hitachi box per datacore server.
Those two storage are located in two separate datacenter connected with a two 10Gb connection (LAN config).
The fabriks are IBM McData 140M.
each ESX has two FC cards connected on each McData switch which are connected to the datacore servers.
Replication between storage boxes is enabled. Every component is in the VMWare HCL and all the technical bulletin have been followed.
The ESXs (and storage) are configured to use ALUA and fixed_AP psp.
The issue :
When we stop one switch, the failover occured, and one path became dead.When the switch is back on, the secondary path switch back online.
When a datacore server is restarted, one of the two paths became dead but never go back online even if the datacore server is on.
The only way to get back the second path is ESX reboot...
I installed many infrastructure but never have this case. I'm not able to find any solution and find no clue to solve it.
Help or info will be appreciate, of course.
Couple of questions
1. Once the vVol's have recovered (Healthy) are the paths still dead?
2. When you check the "channel" on the SANsymphony server are the ESX hosts "logged in"?
3. Are all the channels heathly for the specific Vvols? (Once the mirror relationship recovers)
4. Is the vVol configured as a 3PAR? (Multipathed correctly?)
ESX Host 1 --> SANsymphony1- Port1 (Example)
---> SANsymphony2- Port 1 (Example)
Have you tried:
1. Re-initializing the channel? once the mirrored volume has recovered?
2. I see you have reviewed the follow TB5 on the DataCore Site? http://datacore.custhelp.com/app/answers/detail/a_id/578/~/technical-bulletins-for-use-with-datacore...
3. Once the mirror relationships have recovery can you rescan for LUN/s / VMFS's on the host to bring the path up? (The Reboot obviously works).
4. Also, can you confirm what advanced settings you have set for your application channels? (E.G: Link mode, Speed, and other settings)...
Also, there is a fix listed for SANsymphony 7.0 PSP 4 --> Update 1 -->
"Problem: After target port re-initialization, some initiator ports, especially on VMware ESX application servers, failed to re-login."
Have you applied Update 1 or 2 Yet? for SANsymphony 7.0 PSP4
Can you provide a diagram of the configuration?
You're right, my post was not complete.
We installed the PSP4 U2 which is supposed to include the fix for this bug (solved by the update 1).
ESX server are logued out after datacaore reboot.
All the vVols are ok. The multipathing seems correct (are it workeds fine when we tried to disable ports on fabriks).
We ried to re initialize the port, but no luck. An ESX rescan gave us no more success.
for the last question, if i understand your request, we tried the different mode : loop, PtoP and the loop and PtoP if failed.
(the loop mode is not supported by the mcData switches).
The speed is 2Gb. the other settings are default except the Disable Port While Stopped as described in the datacore TB5c document.
Just to re-clarify.....Are the ESX hosts shown as "logged in" (On the SANsymphony Server) to the supposidly dead path? once the SANsymphony box is up and running after to reboot? (maybe I mis-understood your comment on this)... If not, then your comments confirm that a re-initialise still does not make them login...... correct?
What if you dis-connect the App channel FC cable and re-connect to the FC switch? same for the ESX host? (then rescan)....
If they are listed as logged out, can you see if the Link status for the app channel is good (to the FC switch?).
Typically the app channel should be set to "Target only" however it sometimes helps to set the port to Target/Initiator with some configuration to ensure the HBA on the DataCore servers logs in to the FC switch correctly. (un-common issue but can happen)
Have you tried (while in this failed path scenario) mapping a new non-mirrored volume up the dead path to the same ESX host to see if it is seen?
I guess what I am trying to test is to see if all volumes on the rebooted SANsymphony box effected? (Do you have any other App channels you can try?)
Back to cover the standard stuff: Vmware patching, FC switch firmware,
Can you confirm you have supported HBA's and firmware for SANsymphony? (I know you listed you checked all HCL's and so on.....but anyway)
Looks like you have tried just about everything, may be time to log a support call with DataCore. or maybe VMware!!
Thanks for support Bernie:)
Yes, we tried to remove the fibre to "force" a new connection -> no luck.
When adding a new volume, only one path is seen.
You're right, on the ESX side, the path looks dead as long as we perform a rescan of hba which makes the dead path disapered.
We have 3 cards(dual channel) for apps (one couple dedicated to ESX 4.1), no other are affected (ESX 3.5, MSCS, unix machines).
The only workaround we found was to put the card on datacore to target/initiator to force a login and put it back to target after.
Two calls have been open to vmware and datacore, but no "serious" answers at this time.
Yes, regarding Target/Initiator mode as I mentioned previsouly I have seen this before. from memory it was when using direct connection FC or when using some FC switches with old firmware............
Certainly not a fix or an Ideal configuration setting for the App channel, let us know what you get back from VMware / DataCore.
I am just finishing a SANsymphony-V 8.0 Split site installation with single ESX 4.1U1 Cluster (4 hosts) across sites. All working OK including failing and recoverying DataCore servers. My pathing is set to RR ALUA. Our customer is lucky enough to have multiple dedicated fibre runs between the sites (Yeay).
All the best.
Thanks once again:)
i'll keep you informed about this case.
Just one remark about RR/ALUA, the feature of failback is not available with this configuration even if it looks better for performances we choose the fixed AP mode to let the datacore server to fix the prefered path (this config permits the failback).
what kind of SAN switches do you use? did you apply new firmware?
be aware to check "Known Issues: Third Party Software/Hardware with DataCore Software" - a pdf from DataCore.
Some switches are in need of special parameters f.e. on page 4 is an logged out problem with brocade switches.
Maybe you should check this.