VMware Cloud Community
pauliew1978
Enthusiast
Enthusiast
Jump to solution

alternate pathing works on one esx server but not the other

Hi there,

I have 2 esx servers and 2 sans (sanmelody) set up in an alternate path set up (i.e. there are two switches in between the esx servers and sans). At the moment, if the link between the primary san and the switch goes down esx 2 fails over by alternate pathing to san2 (san 2 then becomes primary). However, esx1 is not failing over properly. On the san side I get an error stating "An invalid task management command 6 was received from port tag 4. Command rejected.".

I am trying to work out if it is a san issue or a vmware issue but not sure where to look. I have double checked all the settings and it seems to be ok. It looks as if san2 rejects esx1's request to use the path.

Does anyone have any ideas where I should start looking?

many thanks,

Paul

0 Kudos
1 Solution

Accepted Solutions
Michelle_Laveri
Virtuoso
Virtuoso
Jump to solution

I also found this in the vmkwarning log file...

Check Unit Ready Command returned an error instead of

NOT READY for standby controller

if anyone has any suggestions please chime in, I am

at a loss Smiley Sad

This error message is in one those PPT slides...

Regards

Mike

Regards
Michelle Laverick
@m_laverick
http://www.michellelaverick.com

View solution in original post

0 Kudos
9 Replies
Michelle_Laveri
Virtuoso
Virtuoso
Jump to solution

How are you testing the failover... it could be a path-trashing situation caused by incorrect cabling...

There was TSX presentation on some of the popular SAN based problems...

It was in one of the two "Top Support Issues"

http://www.vmware-tsx.com/download.php?asset_id=49

http://www.vmware-tsx.com/download.php?asset_id=50

Regards

Mike

Regards
Michelle Laverick
@m_laverick
http://www.michellelaverick.com
0 Kudos
pauliew1978
Enthusiast
Enthusiast
Jump to solution

nice one Mike,

I will have a read through and see if I can solve the problem

cheers,

Paul

0 Kudos
pauliew1978
Enthusiast
Enthusiast
Jump to solution

looks like I have a tresspass problem

May 22 11:49:40 esx1 vmkernel: 0:20:52:06.438 cpu2:1038)WARNING: SCSI: 1785: Manual switchover to path vmhba40:1:0 begins.

May 22 11:49:40 esx1 vmkernel: 0:20:52:06.438 cpu2:1038)SCSI: 1789: Changing active path to vmhba40:1:0

May 22 11:49:40 esx1 vmkernel: 0:20:52:06.438 cpu2:1038)iSCSI: session 0x814fb88 eh_device_reset at 7512628 for command 0x800bfe8 to (0 0 1 0), cdb 0x0

May 22 11:49:40 esx1 vmkernel: 0:20:52:06.438 cpu3:1058)iSCSI: session 0x814fb88 requested target reset for (0 0 1 *), warm reset itt 251829 at 7512628

May 22 11:49:40 esx1 vmkernel: 0:20:52:06.438 cpu2:1059)iSCSI: session 0x814fb88 warm target reset rejected (0x5) for mgmt 251829 at 7512628

May 22 11:49:46 esx1 vmkernel: 0:20:52:12.458 cpu2:1026)WARNING: SCSI: 5422: READ of handleID 0x98a

May 22 11:49:46 esx1 vmkernel: 0:20:52:12.458 cpu2:1034)SCSI: 8021: vmhba40:1:0:1 status = 0/3 0x0 0x0 0x0

May 22 11:49:46 esx1 vmkernel: 0:20:52:12.458 cpu2:1034)SCSI: 8109: vmhba40:1:0:1 Retry (abort after timeout)

May 22 11:49:46 esx1 vmkernel: 0:20:52:12.458 cpu2:1034)SCSI: 3169: vmhba40:1:0:1 Abort cmd due to timeout, s/n=109545, attempt 1

May 22 11:49:46 esx1 vmkernel: 0:20:52:12.458 cpu2:1034)LinSCSI: 3596: Aborting cmds with world 1024, originHandle 0x6201930, originSN 109545 from vmhba40:0:0

May 22 11:49:46 esx1 vmkernel: 0:20:52:12.458 cpu2:1034)SCSI: 3182: vmhba40:1:0:1 Abort cmd on timeout succeeded, s/n=109545, attempt 1

May 22 11:49:46 esx1 vmkernel: 0:20:52:12.460 cpu2:1026)WARNING: SCSI: 5422: READ of handleID 0x291c

May 22 11:49:46 esx1 vmkernel: 0:20:52:12.460 cpu2:1033)SCSI: 8021: vmhba40:0:2:0 status = 0/3 0x0 0x0 0x0

May 22 11:49:46 esx1 vmkernel: 0:20:52:12.460 cpu2:1033)SCSI: 8109: vmhba40:0:2:0 Retry (abort after timeout)

May 22 11:49:46 esx1 vmkernel: 0:20:52:12.460 cpu2:1033)SCSI: 3169: vmhba40:0:2:0 Abort cmd due to timeout, s/n=2, attempt 1

May 22 11:49:46 esx1 vmkernel: 0:20:52:12.460 cpu2:1033)LinSCSI: 3596: Aborting cmds with world 1024, originHandle 0x6201960, originSN 2 from vmhba40:0:2

May 22 11:49:46 esx1 vmkernel: 0:20:52:12.460 cpu2:1033)LinSCSI: 3596: Aborting cmds with world 1024, originHandle 0x6201960, originSN 2 from vmhba40:0:2

May 22 11:49:46 esx1 vmkernel: 0:20:52:12.460 cpu2:1033)SCSI: 3182: vmhba40:0:2:0 Abort cmd on timeout succeeded, s/n=2, attempt 1

I never receive the Check Unit Ready Command returned READY instead of NOT READY for standby controller

I will have to investigate some more.

thanks,

Paul

0 Kudos
pauliew1978
Enthusiast
Enthusiast
Jump to solution

I also found this in the vmkwarning log file...

Check Unit Ready Command returned an error instead of NOT READY for standby controller

if anyone has any suggestions please chime in, I am at a loss Smiley Sad

0 Kudos
mcwill
Expert
Expert
Jump to solution

SanMelody also maintains logs in the Trace Console.

If you have one ESX server that fails over correctly and the other does not; is it worthwhile looking for differences in the command history for both ESX servers from the SANs point of view.

(Apologies if I'm teaching my grandmother to suck eggs)

pauliew1978
Enthusiast
Enthusiast
Jump to solution

I have solved the problem! but it is strange. In the sanmelody faq's it says you have to add an entry of "SANmelody :" in the configuration, advanced settings/disks disks.deveiceswithapfailover

When I took the entry out it all worked fine. NOW that is wierd. I am not getting any errors at all now and it works great. Woohoo!. I just created my infrastrcuture. If one san/path fails my vms just use my other san on the other side of the company network Smiley Happy

I am happy. Though a bit disconcerted as to why datacores faq stopped it working. I am on sanmmelody 2.01 update 6.01???

0 Kudos
mcwill
Expert
Expert
Jump to solution

That is good news - though strange that it contradicts datacores faq.

We've just had a quote of 4.5K for a sanmelody DR1 pack (2 servers plus mirror and failover) so will probably be going down the same path ourselves.

A quick question if you don't mind. When the ESX fails over does it cause errors on the running VMs or do they continue running as normal?

0 Kudos
pauliew1978
Enthusiast
Enthusiast
Jump to solution

when esx fails over (in terms of ap) it works fine, no errors discovered at the moment and no ping interuption. I will be testing it with an sql server vm running a query and failing over when the query runs in the next couple of weeks. Will let you know how it goes.

cheers,

paul

0 Kudos
Michelle_Laveri
Virtuoso
Virtuoso
Jump to solution

I also found this in the vmkwarning log file...

Check Unit Ready Command returned an error instead of

NOT READY for standby controller

if anyone has any suggestions please chime in, I am

at a loss Smiley Sad

This error message is in one those PPT slides...

Regards

Mike

Regards
Michelle Laverick
@m_laverick
http://www.michellelaverick.com
0 Kudos