JayDeah
Contributor
Contributor

extremely long boot time (crash?) on ESXi5 host, pause after "iscsi_vmk started"

ok ive got some Dell poweredge 1950 servers connected to an md3200i that have been happily running ESXi 4.1 U1 for some time and i have put ESXi5 on them using a number of scenarios all of which having the same problem. Any thoughts?

the boot process hangs near the end of the progress bar and the last entry on the screen is "iscsi_vmk started successfully"

if i alt+f12 i can see the server hasnt crashed.

i am using vcenter 5.0 with latest update manager and webclient.

i have used the following scenarios to install and configure:

1) host upgrade using updatemanager

2) clean instakll of ESXi5, apply old esxi4.1 host profile, update settings reboot

3) clean install of ESXi5, clean configuration, reboot

scenario 1 appeared to leave the host hanging , i gave up after 30minutes and reinstalled

scenario 2 didnt seem to like having the ESXi4 settings as i ended up with my swiscsi adapter on vmhba39, it did eventually start after about half an hour!

scenario 3 has been hanging for about 15minutes sofar, im hoping it finally boots!

0 Kudos
95 Replies
BrooklynzFinest
Contributor
Contributor

The switches should have vlans for each iscsi subnet.

0 Kudos
BrooklynzFinest
Contributor
Contributor

I do agree 8 vmnics is probably over kill.  I would drop down to 4 vmnics on each each host and continue to use all ports on the controllers.  You also should create a vlan for each iscsi subnet as well.   This should give you 8 paths to each datastore (4 active, 4 stand bye)

0 Kudos
BrooklynzFinest
Contributor
Contributor

Update.  So the latest firmware does not resolve the MPIO issue.  I also see Equalogic SANs are having the same issue based on another thread here.  I wonder if other Storage Vendors are having an issue with MPIO working with ESXi5.

0 Kudos
Catharsis7
Contributor
Contributor

Is someone from VMware going to address this?  I also am having the same problem.  The only way I can get ESXi 5.0 to reboot in a timely fashion is by rebooting the SAN and completely removing and re-adding it.  We have Dell PowerEdge R610s and a CYBERNETICS miSAN.  It also doesn't look like this is the only thread here that is discussing this problem.

0 Kudos
BrooklynzFinest
Contributor
Contributor

I have found the quickest way to resolve this issue is to go to the CLI of easch host and remove all but 1 vmnic from your iSCSI vSwitch on all hosts.

You can do this by using the following command

esxcfg-vSwitch -U vmnic# vSwitch#

You can use the following command to see your vSwitch config

esxcfg-vSwitch -l  

Once the host only has 1 vmnic it should get connectivity back to the host and then you can re add the vmnics from the vSphere client.  This works for me and there is no need to reboot the SAN or Hosts.

0 Kudos
JayDeah
Contributor
Contributor

look at the times on thse entries, those arent long timeouts it takes less than half a second to do that probe

0 Kudos
JayDeah
Contributor
Contributor

those of you with problems, are you all using Jumbo frames?

one of Dell's suggestions in my support request is to try without jumbo frames enabled on the SAN (not tried it yet)

0 Kudos
NicolaB
Contributor
Contributor

hi,

i've reconsidered my whole iscsi setup based on your suggestions, here's the final config:

1837334.png

RED lines belong to one VLAN and BLU lines to another one.

I've just rebooted one host and it takes still long. I'm using jumbo frames.

0 Kudos
JayDeah
Contributor
Contributor

you have multiple layer 3 subnets on the same vlans on the same switches.

whilst this will still work (as it's L2 / L3) it's not best practice

also you have the same subnets split across multiple switches, im guessing you're using trunking between the 2 switches? if not it would enver work anyway as you have no route between vmk and san on half your nics!

ditch this idea. remember if a switch fails you SHOULD lose 50% of your paths, that is a valid scenario, do not try and work in L3 resilience at the physical switch level.

0 Kudos
NicolaB
Contributor
Contributor

JayDeah ha scritto:

you have multiple layer 3 subnets on the same vlans on the same switches.

whilst this will still work (as it's L2 / L3) it's not best practice

also you have the same subnets split across multiple switches, im guessing you're using trunking between the 2 switches? if not it would enver work anyway as you have no route between vmk and san on half your nics!

ditch this idea. remember if a switch fails you SHOULD lose 50% of your paths, that is a valid scenario, do not try and work in L3 resilience at the physical switch level.

excuse me, where do you see that?

switch A: VLAN blu = subnet 192.168.1.0/24 - VLAN red = subnet 192.168.2.0/24

switch B: VLAN blu = subnet 192.168.3.0/24 - VLAN red = subnet 192.168.4.0/24

0 Kudos
JayDeah
Contributor
Contributor

so you have 4 seperate vlans then all with unique numbers and no trunk between the switches?

0 Kudos
NicolaB
Contributor
Contributor

JayDeah ha scritto:

so you have 4 seperate vlans then all with unique numbers and no trunk between the switches?

right.

0 Kudos
Catharsis7
Contributor
Contributor

I have brought the iSCSI down to single path with 1500 MTU set up on the SAN and on the Dell server.  It still takes forever to boot, and it still happens at the "iscsi_vmk loaded succesfully" screen.  1 NIC, 1 LUN, 1 Datastore, 1 IP on each side.

0 Kudos
NicolaB
Contributor
Contributor

Catharsis7 ha scritto:

I have brought the iSCSI down to single path with 1500 MTU.  It still takes forever to boot, and it still happens at the "iscsi_vmk loaded succesfully" screen.  1 NIC, 1 LUN, 1 Datastore, 1 IP on each side.

in my case it seems to take a lot after iscsi_mask_path loaded succesfully.

0 Kudos
Catharsis7
Contributor
Contributor

Nicola Bressan wrote:

in my case it seems to take a lot after iscsi_mask_path loaded succesfully.

In either case, this is clearly a problem that VMware needs to address.  If I remove the iSCSI connection, ESXi will boot fast.  As soon as there is any hint of a very basic iSCSI data connection, reboot times are absurd.  I do not recall if this was happening with 4.1, but I don't recall ever rebooting.  (We only had it set up for about a month with very minimal testing.  These servers are not yet in our production environment.)

0 Kudos
buckmaster
Enthusiast
Enthusiast

Having the same issue in my home lab connected to a Iomega NAS StorCenter ix4-200d.  Did NOT have this issue with 4.1.  I also had this issue with the beta code of 5 and the posting was not resolved.

Thanks for the post.  Maybe VMware will provide a fix.

Tom

Tom Miller
0 Kudos
ashleyw
Contributor
Contributor

Hi, I get the same problems connecting to a NexentaStor iscsi target and vsphere5. We have tried all combinations of VAAI on/off, Jumbo frames on/off, presenting the storage as one/multiple targets, removing multipathing, trying different target portal group combinations etc etc- all makes no difference we still see the significant delays at start up - in our case delays cause boot time to take up to 45 minutes. After the long boot up, everything appears to work the way it should. We see the issue on nested ESX5i instances, on UCS blades, on DL360 G7s and G6s. Please can someone from VMware acknowledge these issues and give some form of statement/intent to fix. There seem to be enough people experiencing the exact same issue on a variety of different equipment to indicate this is an issue specifically with vSphere5 and iscsi shops. Please PM me if anyone from VMware needs the megaupload URL for the full support logs from one of the hosts in our environment.

0 Kudos
MichaelW007
Enthusiast
Enthusiast

Hi Guys,

Just wanted to let you know that VMware is well aware of this problem and the bug is logged as an extremely high priority (read critical) and a fix will be available as soon as possible. There are limited workarounds at this stage other than limiting the targets to those that can be connected to by all bound initiator ports. Believe me they are taking this very seriously and have had many support requests logged due to this issue.

This behaviour is new in ESXi 5.0 due to the way the iSCSI target login and discovery code performs multiple retries per combination of target and initiatior. This can cause very lengthy delays during boot time due to the exponential retries that are caused as initiatior and target numbers increase. In 4.1 there was no retry during discovery or login during boot, hence no delay in the boot times.

I would encourage you to log a support request if you have a current support agreement to ensure that you receive the patch and notifications as soon as they are available. I'll also post back to this as soon as I'm aware the patch is available.

0 Kudos
flix21
Contributor
Contributor

Same issue here - Dell R710 servers, Dell MD 3220i SAN, ESXi v5.  I'm contacting Dell support to see what their stance is.

0 Kudos
flix21
Contributor
Contributor

Here is an update from VMware that is related to this issue:

http://blogs.vmware.com/vsphere/2011/10/slow-booting-of-esxi-50-when-iscsi-is-configured.html

Here is the associated KB article:

http://kb.vmware.com/kb/2007108

0 Kudos