VMware Cloud Community
smoke455
Contributor
Contributor

iSCSI redundancy/failover help

I have been having an intermittent problem losing my iSCSI connection on my ESX 3.1 server about once a week. So I am trying to add a redundant or failover connection to the iSCSI box.

So far I setup this;

iSCSI Appliance

192.168.2.1/255.255.255.0

192.168.3.1/255.255.255.0

ESX Server

vSwitch3

192.168.2.10 (Service Console)

192.168.2.20 (VMKernel)

vSwitch5

192.168.3.10 (Service Console)

192.168.3.20 (VMKernel)

under configuration, I click on properties for the iSCSI software adapter

under dynamic discovery I have both IP addresses listed for both of the iSCSI appliance nics

is this enough to get redundancy/failover or am I missing something?

(I'm a noob and learning as I go on this...)

Reply
0 Kudos
10 Replies
christianZ
Champion
Champion

What iscsi storage are you using?

I suppose you can't do failover by configuring 2 target's ip addresses.

Reply
0 Kudos
smoke455
Contributor
Contributor

What iscsi storage are you using?

its a miSAN iSCSI unit from Cybernetics

I suppose you can't do failover by configuring 2

target's ip addresses.

well, the first thing I tried was to set the 2 nic's in the iSCSI unit to 192.168.2.1 and 192.168.2.2, but the iSCSI unit complains that the addresses can't be on the same subnet.

So I setup 2 different VLANS and put each nic in its own vlan. Setup the vlans on 2 switches and setup 2 vswitches in ESX. Since iSCSI on ESX won't list 2 paths to the same iSCSI target I wasn't sure that is was finding the target.

This morning I pulled the network cable on the 192.168.2.1 nic and it didn't miss a beat switching over to the 192.168.3.1 nic. So I guess it does work.

I'm still trying to find if ESX keeps a log of when one iSCSI link dies and it starts using the other.

Reply
0 Kudos
christianZ
Champion
Champion

To really test the failover you should have a running vm on iscsi volume (e.g. with running clock) and then pull the cable.

You can check /var/log/vmkernel and /var/log/messages for warnings/errors.

Can you post here following:

esxcfg-mpath -l

esxcfg-vmknic -l

esxcfg-vswif -l

esxscg-vswitch -l

smoke455
Contributor
Contributor

To really test the failover you should have a running

vm on iscsi volume (e.g. with running clock) and then

pull the cable.

I did keep one NetWare server running at the DRDOS level - it could still run commands and load NetWare after I pulled the cable.

You can check /var/log/vmkernel and /var/log/messages

for warnings/errors.

Thanks

Can you post here following:

esxcfg-mpath -l

esxcfg-vmknic -l

esxcfg-vswif -l

esxscg-vswitch -l

esxcfg-mpath -l

Disk vmhba0:0:0 /dev/sda (152587MB) has 1 paths and policy of Fixed

Local 2:8.0 vmhba0:0:0 On active preferred

Disk vmhba40:0:0 /dev/sdc (1430448MB) has 1 paths and policy of Fixed

iScsi sw iqn.1998-01.com.vmware:cwg157-160f7576<->iqn.2007-06.com.cybernetics:17896443bd2666ddda377ea4b96fd6cf.vdisk2 vmhba40:0:0 On active preferred

esxcfg-vmknic -l

Port Group IP Address Netmask Broadcast MAC Address MTU Enabled

VMkernel 2 192.168.3.20 255.255.255.0 192.168.3.255 00:50:56:64:9f:67 1514 true

VMkernel 192.168.2.20 255.255.255.0 192.168.2.255 00:50:56:6f:88:d8 1514 true

esxcfg-vswif -l

Name Port Group IP Address Netmask Broadcast Enabled DHCP

vswif0 Service Console 10.10.75.25 255.255.255.0 10.10.75.255 true false

vswif1 Service Console 2 192.168.2.10 255.255.255.0 192.168.2.255 true false

vswif2 Service Console 3 192.168.3.10 255.255.255.0 192.168.3.255 true false

esxscg-vswitch -l

-bash: esxscg-vswitch: command not found

Reply
0 Kudos
christianZ
Champion
Champion

sorry the last command should be:

esxcfg-vswitch -l

Did you pull the cable from host or from iscsi storage?

Message was edited by:

christianZ

Reply
0 Kudos
smoke455
Contributor
Contributor

I pulled the cable from the iSCSI appliance. The vmkernel log showed it took 3 seconds to switch over to the other nic and resume using the iSCSI LUN on the other address.

esxcfg-vswitch -l

Switch Name Num Ports Used Ports Configured Ports Uplinks

vSwitch0 32 3 32 vmnic0

PortGroup Name Internal ID VLAN ID Used Ports Uplinks

Service Console portgroup0 0 1 vmnic0

Switch Name Num Ports Used Ports Configured Ports Uplinks

vSwitch1 64 4 64 vmnic5,vmnic4

PortGroup Name Internal ID VLAN ID Used Ports Uplinks

RJ1 VM Network portgroup4 0 1 vmnic4,vmnic5

Switch Name Num Ports Used Ports Configured Ports Uplinks

vSwitch2 64 8 64 vmnic7,vmnic6

PortGroup Name Internal ID VLAN ID Used Ports Uplinks

CH1 VM Networkportgroup5 0 5 vmnic6,vmnic7

Switch Name Num Ports Used Ports Configured Ports Uplinks

vSwitch3 64 4 64 vmnic1

PortGroup Name Internal ID VLAN ID Used Ports Uplinks

VMkernel portgroup6 0 1 vmnic1

Service Console 2 portgroup7 0 1 vmnic1

Switch Name Num Ports Used Ports Configured Ports Uplinks

vSwitch4 64 4 64 vmnic3

PortGroup Name Internal ID VLAN ID Used Ports Uplinks

RC1 VM Networkportgroup9 0 2 vmnic3

Switch Name Num Ports Used Ports Configured Ports Uplinks

vSwitch5 64 4 64 vmnic2

PortGroup Name Internal ID VLAN ID Used Ports Uplinks

Service Console 3 portgroup11 0 1 vmnic2

VMkernel 2 portgroup10 0 1 vmnic2

Reply
0 Kudos
christianZ
Champion
Champion

What happens when you pull the cable from Esx host?

I would test it under load though.

This constellation is new for me - I must admit I didn't think that will work but it seems to work.

Reply
0 Kudos
smoke455
Contributor
Contributor

I pulled the cable from the ESX host with all 8 guests running. None of the servers crashed, but did record 'disk timeout error' in their log files.

I get the following message in the vmkernel log file

09:16:41 vmkernel: 3:05:10:49.809 cpu1:1131)<3>bnx2: vmnic2 NIC Link is Down
09:18:31 vmkernel: 3:05:12:39.702 cpu3:1060)iSCSI: session 0x3d5c3bf8 connect timed out at 27795914
09:18:31 vmkernel: 3:05:12:39.702 cpu3:1060)iSCSI: session 0x3d5c3bf8 to VMWare Disk 1 waiting 1 seconds before next login attempt
09:18:32 vmkernel: 3:05:12:40.703 cpu3:1060)iSCSI: bus 0 target 0 trying to establish session 0x3d5c3bf8 to portal 0, address 192.168.2.1 port 3260 group 1
09:18:32 vmkernel: 3:05:12:40.704 cpu3:1060)<7>iSCSI: session 0x3d5c3bf8 authenticated by target iqn.2007-06.com.cybernetics:17896443bd2666ddda377ea4b96fd6cf.vdisk2

09:18:32 vmkernel: 3:05:12:40.705 cpu3:1060)iSCSI: bus 0 target 0 established session 0x3d5c3bf8 #7 to portal 0, address 192.168.2.1 port 3260 group 1, alias VMWare Disk 1

Reply
0 Kudos
zbenga
Enthusiast
Enthusiast

i run multipathing but i have 2 HBA's (QLA4050c) per ESX host and i have two switches (ProCurve) each switch is connected to both SPs (CX300i)

tested and passed with flying colours but it's not cheap

have a look at what esxcfg-mpath -l gives you

Reply
0 Kudos
smoke455
Contributor
Contributor

have a look at what esxcfg-mpath -l gives you

it just shows the one path that exists.

I know this isn't the perfect setup. The final plan is to have 2 ESX servers, 2 switches, and 2 iSCSI targets. Management won't release funds for the rest of the equipment until the current setup is stable, reliable, and faster. That is why I was hoping that 2 IP addresses for the 1 LUN would provide failover and stop the random loss of the iSCSI connection. So far it seems to be working when I pull the cable - I'm just not confident it will help with the random disconnects.

Reply
0 Kudos