Hi there,
I have 2 esx servers and 2 sans (sanmelody) set up in an alternate path set up (i.e. there are two switches in between the esx servers and sans). At the moment, if the link between the primary san and the switch goes down esx 2 fails over by alternate pathing to san2 (san 2 then becomes primary). However, esx1 is not failing over properly. On the san side I get an error stating "An invalid task management command 6 was received from port tag 4. Command rejected.".
I am trying to work out if it is a san issue or a vmware issue but not sure where to look. I have double checked all the settings and it seems to be ok. It looks as if san2 rejects esx1's request to use the path.
Does anyone have any ideas where I should start looking?
many thanks,
Paul
I also found this in the vmkwarning log file...
Check Unit Ready Command returned an error instead of
NOT READY for standby controller
if anyone has any suggestions please chime in, I am
at a loss
This error message is in one those PPT slides...
Regards
Mike
How are you testing the failover... it could be a path-trashing situation caused by incorrect cabling...
There was TSX presentation on some of the popular SAN based problems...
It was in one of the two "Top Support Issues"
http://www.vmware-tsx.com/download.php?asset_id=49
http://www.vmware-tsx.com/download.php?asset_id=50
Regards
Mike
nice one Mike,
I will have a read through and see if I can solve the problem
cheers,
Paul
looks like I have a tresspass problem
May 22 11:49:40 esx1 vmkernel: 0:20:52:06.438 cpu2:1038)WARNING: SCSI: 1785: Manual switchover to path vmhba40:1:0 begins.
May 22 11:49:40 esx1 vmkernel: 0:20:52:06.438 cpu2:1038)SCSI: 1789: Changing active path to vmhba40:1:0
May 22 11:49:40 esx1 vmkernel: 0:20:52:06.438 cpu2:1038)iSCSI: session 0x814fb88 eh_device_reset at 7512628 for command 0x800bfe8 to (0 0 1 0), cdb 0x0
May 22 11:49:40 esx1 vmkernel: 0:20:52:06.438 cpu3:1058)iSCSI: session 0x814fb88 requested target reset for (0 0 1 *), warm reset itt 251829 at 7512628
May 22 11:49:40 esx1 vmkernel: 0:20:52:06.438 cpu2:1059)iSCSI: session 0x814fb88 warm target reset rejected (0x5) for mgmt 251829 at 7512628
May 22 11:49:46 esx1 vmkernel: 0:20:52:12.458 cpu2:1026)WARNING: SCSI: 5422: READ of handleID 0x98a
May 22 11:49:46 esx1 vmkernel: 0:20:52:12.458 cpu2:1034)SCSI: 8021: vmhba40:1:0:1 status = 0/3 0x0 0x0 0x0
May 22 11:49:46 esx1 vmkernel: 0:20:52:12.458 cpu2:1034)SCSI: 8109: vmhba40:1:0:1 Retry (abort after timeout)
May 22 11:49:46 esx1 vmkernel: 0:20:52:12.458 cpu2:1034)SCSI: 3169: vmhba40:1:0:1 Abort cmd due to timeout, s/n=109545, attempt 1
May 22 11:49:46 esx1 vmkernel: 0:20:52:12.458 cpu2:1034)LinSCSI: 3596: Aborting cmds with world 1024, originHandle 0x6201930, originSN 109545 from vmhba40:0:0
May 22 11:49:46 esx1 vmkernel: 0:20:52:12.458 cpu2:1034)SCSI: 3182: vmhba40:1:0:1 Abort cmd on timeout succeeded, s/n=109545, attempt 1
May 22 11:49:46 esx1 vmkernel: 0:20:52:12.460 cpu2:1026)WARNING: SCSI: 5422: READ of handleID 0x291c
May 22 11:49:46 esx1 vmkernel: 0:20:52:12.460 cpu2:1033)SCSI: 8021: vmhba40:0:2:0 status = 0/3 0x0 0x0 0x0
May 22 11:49:46 esx1 vmkernel: 0:20:52:12.460 cpu2:1033)SCSI: 8109: vmhba40:0:2:0 Retry (abort after timeout)
May 22 11:49:46 esx1 vmkernel: 0:20:52:12.460 cpu2:1033)SCSI: 3169: vmhba40:0:2:0 Abort cmd due to timeout, s/n=2, attempt 1
May 22 11:49:46 esx1 vmkernel: 0:20:52:12.460 cpu2:1033)LinSCSI: 3596: Aborting cmds with world 1024, originHandle 0x6201960, originSN 2 from vmhba40:0:2
May 22 11:49:46 esx1 vmkernel: 0:20:52:12.460 cpu2:1033)LinSCSI: 3596: Aborting cmds with world 1024, originHandle 0x6201960, originSN 2 from vmhba40:0:2
May 22 11:49:46 esx1 vmkernel: 0:20:52:12.460 cpu2:1033)SCSI: 3182: vmhba40:0:2:0 Abort cmd on timeout succeeded, s/n=2, attempt 1
I never receive the Check Unit Ready Command returned READY instead of NOT READY for standby controller
I will have to investigate some more.
thanks,
Paul
I also found this in the vmkwarning log file...
Check Unit Ready Command returned an error instead of NOT READY for standby controller
if anyone has any suggestions please chime in, I am at a loss
SanMelody also maintains logs in the Trace Console.
If you have one ESX server that fails over correctly and the other does not; is it worthwhile looking for differences in the command history for both ESX servers from the SANs point of view.
(Apologies if I'm teaching my grandmother to suck eggs)
I have solved the problem! but it is strange. In the sanmelody faq's it says you have to add an entry of "SANmelody :" in the configuration, advanced settings/disks disks.deveiceswithapfailover
When I took the entry out it all worked fine. NOW that is wierd. I am not getting any errors at all now and it works great. Woohoo!. I just created my infrastrcuture. If one san/path fails my vms just use my other san on the other side of the company network
I am happy. Though a bit disconcerted as to why datacores faq stopped it working. I am on sanmmelody 2.01 update 6.01???
That is good news - though strange that it contradicts datacores faq.
We've just had a quote of 4.5K for a sanmelody DR1 pack (2 servers plus mirror and failover) so will probably be going down the same path ourselves.
A quick question if you don't mind. When the ESX fails over does it cause errors on the running VMs or do they continue running as normal?
when esx fails over (in terms of ap) it works fine, no errors discovered at the moment and no ping interuption. I will be testing it with an sql server vm running a query and failing over when the query runs in the next couple of weeks. Will let you know how it goes.
cheers,
paul
I also found this in the vmkwarning log file...
Check Unit Ready Command returned an error instead of
NOT READY for standby controller
if anyone has any suggestions please chime in, I am
at a loss
This error message is in one those PPT slides...
Regards
Mike