I have been having an intermittent problem losing my iSCSI connection on my ESX 3.1 server about once a week. So I am trying to add a redundant or failover connection to the iSCSI box.
So far I setup this;
iSCSI Appliance
192.168.2.1/255.255.255.0
192.168.3.1/255.255.255.0
ESX Server
vSwitch3
192.168.2.10 (Service Console)
192.168.2.20 (VMKernel)
vSwitch5
192.168.3.10 (Service Console)
192.168.3.20 (VMKernel)
under configuration, I click on properties for the iSCSI software adapter
under dynamic discovery I have both IP addresses listed for both of the iSCSI appliance nics
is this enough to get redundancy/failover or am I missing something?
(I'm a noob and learning as I go on this...)
What iscsi storage are you using?
I suppose you can't do failover by configuring 2 target's ip addresses.
What iscsi storage are you using?
its a miSAN iSCSI unit from Cybernetics
I suppose you can't do failover by configuring 2
target's ip addresses.
well, the first thing I tried was to set the 2 nic's in the iSCSI unit to 192.168.2.1 and 192.168.2.2, but the iSCSI unit complains that the addresses can't be on the same subnet.
So I setup 2 different VLANS and put each nic in its own vlan. Setup the vlans on 2 switches and setup 2 vswitches in ESX. Since iSCSI on ESX won't list 2 paths to the same iSCSI target I wasn't sure that is was finding the target.
This morning I pulled the network cable on the 192.168.2.1 nic and it didn't miss a beat switching over to the 192.168.3.1 nic. So I guess it does work.
I'm still trying to find if ESX keeps a log of when one iSCSI link dies and it starts using the other.
To really test the failover you should have a running vm on iscsi volume (e.g. with running clock) and then pull the cable.
You can check /var/log/vmkernel and /var/log/messages for warnings/errors.
Can you post here following:
esxcfg-mpath -l
esxcfg-vmknic -l
esxcfg-vswif -l
esxscg-vswitch -l
To really test the failover you should have a running
vm on iscsi volume (e.g. with running clock) and then
pull the cable.
I did keep one NetWare server running at the DRDOS level - it could still run commands and load NetWare after I pulled the cable.
You can check /var/log/vmkernel and /var/log/messages
for warnings/errors.
Thanks
Can you post here following:
esxcfg-mpath -l
esxcfg-vmknic -l
esxcfg-vswif -l
esxscg-vswitch -l
esxcfg-mpath -l
Disk vmhba0:0:0 /dev/sda (152587MB) has 1 paths and policy of Fixed
Local 2:8.0 vmhba0:0:0 On active preferred
Disk vmhba40:0:0 /dev/sdc (1430448MB) has 1 paths and policy of Fixed
iScsi sw iqn.1998-01.com.vmware:cwg157-160f7576<->iqn.2007-06.com.cybernetics:17896443bd2666ddda377ea4b96fd6cf.vdisk2 vmhba40:0:0 On active preferred
esxcfg-vmknic -l
Port Group IP Address Netmask Broadcast MAC Address MTU Enabled
VMkernel 2 192.168.3.20 255.255.255.0 192.168.3.255 00:50:56:64:9f:67 1514 true
VMkernel 192.168.2.20 255.255.255.0 192.168.2.255 00:50:56:6f:88:d8 1514 true
esxcfg-vswif -l
Name Port Group IP Address Netmask Broadcast Enabled DHCP
vswif0 Service Console 10.10.75.25 255.255.255.0 10.10.75.255 true false
vswif1 Service Console 2 192.168.2.10 255.255.255.0 192.168.2.255 true false
vswif2 Service Console 3 192.168.3.10 255.255.255.0 192.168.3.255 true false
esxscg-vswitch -l
-bash: esxscg-vswitch: command not found
sorry the last command should be:
esxcfg-vswitch -l
Did you pull the cable from host or from iscsi storage?
Message was edited by:
christianZ
I pulled the cable from the iSCSI appliance. The vmkernel log showed it took 3 seconds to switch over to the other nic and resume using the iSCSI LUN on the other address.
esxcfg-vswitch -l
Switch Name Num Ports Used Ports Configured Ports Uplinks
vSwitch0 32 3 32 vmnic0
PortGroup Name Internal ID VLAN ID Used Ports Uplinks
Service Console portgroup0 0 1 vmnic0
Switch Name Num Ports Used Ports Configured Ports Uplinks
vSwitch1 64 4 64 vmnic5,vmnic4
PortGroup Name Internal ID VLAN ID Used Ports Uplinks
RJ1 VM Network portgroup4 0 1 vmnic4,vmnic5
Switch Name Num Ports Used Ports Configured Ports Uplinks
vSwitch2 64 8 64 vmnic7,vmnic6
PortGroup Name Internal ID VLAN ID Used Ports Uplinks
CH1 VM Networkportgroup5 0 5 vmnic6,vmnic7
Switch Name Num Ports Used Ports Configured Ports Uplinks
vSwitch3 64 4 64 vmnic1
PortGroup Name Internal ID VLAN ID Used Ports Uplinks
VMkernel portgroup6 0 1 vmnic1
Service Console 2 portgroup7 0 1 vmnic1
Switch Name Num Ports Used Ports Configured Ports Uplinks
vSwitch4 64 4 64 vmnic3
PortGroup Name Internal ID VLAN ID Used Ports Uplinks
RC1 VM Networkportgroup9 0 2 vmnic3
Switch Name Num Ports Used Ports Configured Ports Uplinks
vSwitch5 64 4 64 vmnic2
PortGroup Name Internal ID VLAN ID Used Ports Uplinks
Service Console 3 portgroup11 0 1 vmnic2
VMkernel 2 portgroup10 0 1 vmnic2
What happens when you pull the cable from Esx host?
I would test it under load though.
This constellation is new for me - I must admit I didn't think that will work but it seems to work.
I pulled the cable from the ESX host with all 8 guests running. None of the servers crashed, but did record 'disk timeout error' in their log files.
I get the following message in the vmkernel log file
09:16:41 vmkernel: 3:05:10:49.809 cpu1:1131)<3>bnx2: vmnic2 NIC Link is Down
09:18:31 vmkernel: 3:05:12:39.702 cpu3:1060)iSCSI: session 0x3d5c3bf8 connect timed out at 27795914
09:18:31 vmkernel: 3:05:12:39.702 cpu3:1060)iSCSI: session 0x3d5c3bf8 to VMWare Disk 1 waiting 1 seconds before next login attempt
09:18:32 vmkernel: 3:05:12:40.703 cpu3:1060)iSCSI: bus 0 target 0 trying to establish session 0x3d5c3bf8 to portal 0, address 192.168.2.1 port 3260 group 1
09:18:32 vmkernel: 3:05:12:40.704 cpu3:1060)<7>iSCSI: session 0x3d5c3bf8 authenticated by target iqn.2007-06.com.cybernetics:17896443bd2666ddda377ea4b96fd6cf.vdisk2
09:18:32 vmkernel: 3:05:12:40.705 cpu3:1060)iSCSI: bus 0 target 0 established session 0x3d5c3bf8 #7 to portal 0, address 192.168.2.1 port 3260 group 1, alias VMWare Disk 1
i run multipathing but i have 2 HBA's (QLA4050c) per ESX host and i have two switches (ProCurve) each switch is connected to both SPs (CX300i)
tested and passed with flying colours but it's not cheap
have a look at what esxcfg-mpath -l gives you
have a look at what esxcfg-mpath -l gives you
it just shows the one path that exists.
I know this isn't the perfect setup. The final plan is to have 2 ESX servers, 2 switches, and 2 iSCSI targets. Management won't release funds for the rest of the equipment until the current setup is stable, reliable, and faster. That is why I was hoping that 2 IP addresses for the 1 LUN would provide failover and stop the random loss of the iSCSI connection. So far it seems to be working when I pull the cable - I'm just not confident it will help with the random disconnects.