ok ive got some Dell poweredge 1950 servers connected to an md3200i that have been happily running ESXi 4.1 U1 for some time and i have put ESXi5 on them using a number of scenarios all of which having the same problem. Any thoughts?
the boot process hangs near the end of the progress bar and the last entry on the screen is "iscsi_vmk started successfully"
if i alt+f12 i can see the server hasnt crashed.
i am using vcenter 5.0 with latest update manager and webclient.
i have used the following scenarios to install and configure:
1) host upgrade using updatemanager
2) clean instakll of ESXi5, apply old esxi4.1 host profile, update settings reboot
3) clean install of ESXi5, clean configuration, reboot
scenario 1 appeared to leave the host hanging , i gave up after 30minutes and reinstalled
scenario 2 didnt seem to like having the ESXi4 settings as i ended up with my swiscsi adapter on vmhba39, it did eventually start after about half an hour!
scenario 3 has been hanging for about 15minutes sofar, im hoping it finally boots!
same issue -
vSphere 5
Poweredge R715 with dual 10g adapters (1 of 2 server is configured and atatched to SAN at the moment)
no switch in between, direct connect
Dell MD3620i
takes 30 mins to boot
after bootup no iscsi connections
i check from shell prompt and i cannot ping between server and md3620i interfaces
i have to go into the software initiator, remove the bound nics
then i have to go to networking, remove and re-add vmkernel and respective nic
then i find i can ping between
then go back to software initator and re-bind nics
then targets established
so i take from the workaround in KB2007108 that I just leave one nic bound to the software initiator for now?
For your situation it seems much more likely that your issue is related to the networking problems on host boot, and not necessarily the iSCSI initator or the bug. Make sure you check your drivers and firmware are up to date. If you can't ping the storage on boot that will definitely cause the slow boot issue. But resolving the networking problem may also resolve the boot issue. The slow boot issue will show itself when some or all the bound ports don't have access to some or all of the targets.
I was able to get my iscsi storage running again by putting a single iscsi kernel port on one vswitch with only one pnic. Repeat this for your other NICs dedicated to iSCSI. Then bind both NICs to the software iSCSI initiator. It will still take a long time to reboot, but it will work when the server comes back up.
My setup:
iSCSI-0 -> vswitch2 -> vmnic1
iSCSI-1 -> vswitch3 -> vmnic5
This is not what I normally do, but it prevented the issue that nateccs808 is describing.
This is the same as I stated earlier in the thread. One vSwitch per iSCSI nic is the solution for the dead paths/failing vmkpings. You still have to bind the vSwitches to the initiator to use them. I am using HP P2000 G3 storage so it doesn't seem specific to just Dell MD3220i.
That has always been the case to do multi-pathing and is as per the iSCSI guide. But if the iSCSI initiator doesn't work without any binding (like is did previously when you didn't need to bind) then that could mean there is another bug. You only need to bind the specific vmk ports and vSwitches to the iSCSI initiator if you want to do multipathing, rather than simply relying on NIC failover on the vSwitch.
I've just tested multiple vmk ports on a single vSwitch without any iSCSI port binding, 2 uplinks are active on both of the vmk port groups, and I'm not getting any dead paths at all. All paths are working as expected. I'm using QNAP, OpenFiler, and P4000 storage on Dell T710 hosts.
By active you mean both nics connected to one vSwitch are active? Are you using RR or MRU?
That is not the standard way of setting up MPIO for most arrays. You should set one active and one in unused respectivly in each iSCSI portgroup using only one vSwitch for iSCSI in total.
HP best practice recommends one vSwitch in total while the ESXi iSCSI configuration guide recommends it as an alternative to separate vSwitches.
http://www.vmware.com/pdf/vsphere4/r41/vsp_41_iscsi_san_cfg.pdf (page 38)
Yeah I realise that. Hence my comment on the previous post about the iSCSI guide. But they guys were reporitng dead paths when none of the network portals were bound directly to the iscsi initator in the normal way, vmnic1 - Portgroup1 - iscsi vmk1 , and vmnic2 - portgroup2 - iscsi vmk2 as has always been the case since vSphere 4.0. So what I was describing wasn't about setting up MPIO, it was that the SW iSCSI initiator should work as it did previously without any port bindings (and it does), and not yield any dead paths (which it doesn't), even when multiple active uplinks are present on the port groups and vSwitches. If you're using vDS it's unlikely you'll be using seperate vSwitches for MPIO, you'd normally just use differnet port groups with a different NIC set active and all other NIC's set to unused. So the problem reported is not really related to the slow boot, it's a config problem with the NIC's and iSCSI initiator in that case.
Ok, a bit misunderstood perhaps. I was having dead paths when using the "regular" vSwitch with active+unused adapters and also failed to ping the targets. When I switched to separate vSwitches the dead paths and ping problems went away.
So you were changing the uplinks to active/standby standby/active on the different port groups on the same vSS? That will still work to, and always has. Provided you don't try and do port binding to the iSCSI initiator.
The reason you were getting dead paths is that you had a standby uplink on a port group that was bound to the iSCSI initiator, which as you found out doesn't work, and is not a supported configuration. As soon as you try and bind that to the iSCSI initiator it'll fail. The miss configured networking then resulted in you not being able to ping the storage. But that isn't a result of the bug causing the slow iSCSI boot, it's a result of an incorrect network and iSCSI initiator config.
Also AFAIK the patch to the slow boot bug has almost finished testing. I'm hoping to get an early release of it to test in my lab. I'm discussing the progress with the developers and engineers on a regular basis. So hopefully not too long before the slow boot problem at least can be put to bed once and for all.
Mistyped, I meant unused, not standby.
Ok, so if you had active/unused and unused/active on two port groups on the same standard vSwitch that should have worked for you when bound to the iSCSI initiator. I've just tested this and it works fine. It should not require you to have two seperate standard vSwitches to make it work. The likely cause of something like this could be VLAN's not presented to the uplinks correctly.
I've had this exact setup running for ages on 4.0 and 4.1. The issue popped up after installing new 5.0 hosts. Same cabling, same VLANs, same NIC's (HP nc382t, Broadcoms) and same SAN...
It was quite an emergency to get it fixed so the separate vSwitch was the fastest and it seems to be stable, therefore I haven't opened any case on it yet. Will await the next patch and see if it might be fixed.
That's no good at all. There could be another bug. Hopefully the new patch however takes care of it. I've had no problems since upgrading other than the slow boot times. The IO is stable and the performance is just fantastic. I'll post to here when I know more about when the patch will be out and also I'm sure VMware will post to their blog and KB as well.
has anyone else tried using SPAN on the pNIC and capturing packets as the host boots using wireshark?
i saw some very off bahaviour when i did this and my case has now got the network guys looking into it not just the storage guys!
i can see arp requets from the vmk on the other portgroup/dvuplink/pNIC!
also switching to seperate vSwitch resolved all the storage issues
I don't think this is storage related either. Something is wrong in the handling of the pNIC when setting them as unused, perhaps combined when binding them to the initiator. I am upgradering one of the last 4.1 hosts to 5.0 tomorrow so I might do some sniffing if I find the time.
What teaming method are you using for the port groups? Are you setting them to unused through the GUI or through the command line? How are you doing the upgrade, i.e. complete rebuild or just upgrade using VUM? This doesn't sound like expected behaviour and if you can reproduce it and record the network trace and system logs it would be worthwhile logging an SR for VMware to investigate further. It'd also be interesting to know what the differences are with your set up's vs mine other than just the MD arrays. As I tested using single vSS and setting the port binding and changing the port group upllinks and it all worked perfectly as expected.
I have an SR open with dell and vmware they will be looking at the traces tomorrow I hope.
As for my config, all hosts built from scratch and configured manually (struggling with host profiles and iscsi!) And all config done in gui. The vmks are in portgroups with only 1 dvuplink active and all I've done to bind the adapters was add them to the binding tab in the gui which has no configyration options. The description shows them on the correct pNIC tho.
That's all the configuration you should have to do, and sounds like the correct procedure followed. So other than the known slow boot issue it should be working as expected. Good luck with the SR's.
Hopefully there will be a patch soon for this issue.
I use esxi5 in a lab environment and often the iscsi target is powered off. Now I know why the host seems to have hung at iscsi_vmk
