moberle
Contributor
Contributor

ESX install from PXE boot No clear answers Please help

Jump to solution

Could someone please help with the DHCP/PXE configuration issues. I have to same problem with the PXE client not receiving the DHCP Offer from the DHCP server. We have run a sniff on the network and determined that the DHCP server receives the request and replies with an Offer. The packets never get back to the PXE client.

I am trying to install ESX 3.0.1 using a PXE boot to a MS DHCP/PXE server. The ESX server (Dell 1955 blade) is connected to the network via a switch port on a 6509. The DHCP/PXE server is a Windows 2003 server running in a VM on another ESX host on the same subnet and vlan (Native) as the PXE client.

The cisco port configuration of the ESX server containing the DHCP server VM is as follows:

interface GigabitEthernet3/39

switchport

switchport trunk encapsulation dot1q

switchport trunk native vlan 322

switchport trunk allowed vlan 2-4094

switchport mode trunk

switchport nonegotiate

no ip address

no cdp enable

spanning-tree portfast

spanning-tree bpduguard enable

spanning-tree guard root

end

The cisco port configuration of PXE client machine is as follows:

interface GigabitEthernet3/42

switchport

switchport trunk encapsulation dot1q

switchport trunk native vlan 322

switchport trunk allowed vlan 2-4094

switchport mode trunk

switchport nonegotiate

no ip address

no cdp enable

spanning-tree portfast

spanning-tree bpduguard enable

spanning-tree guard root

end

We have the Virtual Switch on the ESX host that the DHCP VM is running on, set up with a pxe vlan defined with 322 (the vlan defined as native) to allow the Virtual Switch to see the native vlan. and a mgmt (vlan 310) vlan defined to allow tagged vlan traffic to go through to our subnet.

We have 2 vNICs defined in the DHCP VM one using the pxe vlan and one using the mgmt vlan to allow RDP and other network type connections to work across the tagged vlans.

I have tried disconnecting the mgmt vlan connected vNIC to make sure there was not conflict there. (Both Nics have IP addresses on the same subnet I KNOW I KNOW If someone can tell me how to add a second vlan to the one vNIC i would appreciate that as well.) DHCP is setup to only service the vNIC using the the pxe vlan. With this setup the RDP traffic fails and the DHCP Offer is still outbound and not received by the PXE client.

If anyone can please point me to some documentation with the complete configurations needed to setup the Cisco ports and the vSwitches and vNICs or can help me with this directly I would greatly appreciated it.

0 Kudos
1 Solution

Accepted Solutions
dinny
Expert
Expert

Hi Michael,

You no longer seem to have the native VLAN statement set on port 3/42?

It was on in your first post?

Mind you I don't see how all the other bits that now work would work without that?

What is option 60 on MS DHCP?

My DHCP server dosn't seem to recognise that?

To award points.

Click on "helpful" when you reply to award six points

You can do this twice per thread.

Click on "correct" if it is answered to award 10 points

Dinny

View solution in original post

0 Kudos
18 Replies
virtech
Expert
Expert

A few people have reported problems like this. Sounds like a bug.

See this link

https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=189795

virtech
Expert
Expert

As a test are you able to put a hub between your ESX server and the 6509 ? And retry the build?

0 Kudos
dinny
Expert
Expert

Hiya,

I saw your posts on a couple of other threads.

It sounds like your problem is happening well before the second dhcp request in the kickstart referenced in the bug check above?

Is that right?

I'm running ESX servers in two sites - one with Cisco servers running IOS and one with Cisco servers running Cat OS - not sure which hardware off hand.

I am in the irritating position where the on site IOS ones work OK and the remote site Cat OS ones fail - almost certainly with the problem in the bugcheck (thanks Mooihoek) - whereby I get to the PXE server fine - pick an build option - then kickstart fails to get an IP address and bombs out with "unable to find the relevant .img file".

Are you using the UDA app (where one of your other posts was) - or are you actually running a PXE server on your DHCP server?

I am using UDA, via a MS DHCP server with scope options 66 and 67 set to locate the UDA server.

I did initially have a problem accessing the UDA server as I was only using tagged packets and the PXE agent didn't recognise them. When I added the native vlan it worked OK.

However I have not tried to actually bind the Native VLAN to an ESX portgroup.

There is a whitepaper on it - not sure if you've seen it:

http://www.vmware.com/pdf/esx3_vlan_wp.pdf

There are a couple of commands mentioned at the bottom which might make a difference?

"vlan dot1q tag native" etc...

If you can it might be worth temporarily setting up a standalone server to host DHCP/PXE - then you will be able to pinpoint your problem a little better.

If that works OK - then it would point to issues with passing the native vlan on to the ESX port group - if that fails - then it is presumably more to do with your actual Cisco environment?

Dinny

moberle
Contributor
Contributor

Sorry for posting more than once realized i had posted on a thread that had been answered. and then realized i was somehow logged in to someone elses ID (Darrius) Dont know how that happened.

Anyway I have a MS DHCP server running as a VM on an ESX host, that is also the PXE server. I plan on using the MS Deployment server and BDD services to provision VM's, as well as provisioning and patching my ESX hosts.

The PXE boot works inside the vSwitch. I found that out today. I created a VM on the same ESX host as the DHCP server resides and let it PXE Boot. It found the server and PXELinux booted the PXE menu i had set up. Yes the problem is before the bug with kickstart. The PXE client never receives the offer response from the DHCP server as my first post stated (This was proven with a network sniff from both the Cisco and the vSwitch sides. The DHCP offer broadcast just disappears somewhere between the DHCP's vNIC and the Physical Cisco switch port. I believe it is because the vSwitch doesnt know that the the Native untagged VLAN 322 traffic isnt tagged and is tagging it. My knowledge of vlan tagging is limited but this appears to be the issue. At first i thought maybe the ESX host firewall was blocking the the packets but then i got my mind right and realized that the firewall only looks at packets destined for the ESX server itself.

I have read the white paper that you mentioned. The problem is I was not able to get the DHCP server to even see the DHCP request before the I set up the native VLAN. Now at least the request gets processed by the DHCP server. As for tagging Native VLAN packets wouldnt that just cause the same problem as i had in the first place?

Dinny could you send me a sample of your Cisco Port configs that you are using. This isnt i believe a complex problem just one that the solution is elusive. You mention that you had the problem with PXE and tagged packets. I set up the untagged native VLAN and bound it to the vSwitch as a means to try to fix the problem. Since you have had success at least to that point in doing this. I think your configuration could help me.

Thanks for your reponse If anyone else has anything that might help I am open to anything that you can come up with.

I also plan on trying a stand alone DHCP server. But if you could send me the configs before i start adding services to another server I could at least rule out that issue.

Hiya,

I saw your posts on a couple of other threads.

It sounds like your problem is happening well before

the second dhcp request in the kickstart referenced

in the bug check above?

Is that right?

I'm running ESX servers in two sites - one with Cisco

servers running IOS and one with Cisco servers

running Cat OS - not sure which hardware off hand.

I am in the irritating position where the on site IOS

ones work OK and the remote site Cat OS ones fail -

almost certainly with the problem in the bugcheck

(thanks Mooihoek) - whereby I get to the PXE server

fine - pick an build option - then kickstart fails to

get an IP address and bombs out with "unable to find

the relevant .img file".

Are you using the UDA app (where one of your other

posts was) - or are you actually running a PXE server

on your DHCP server?

I am using UDA, via a MS DHCP server with scope

options 66 and 67 set to locate the UDA server.

I did initially have a problem accessing the UDA

server as I was only using tagged packets and the PXE

agent didn't recognise them. When I added the native

vlan it worked OK.

However I have not tried to actually bind the Native

VLAN to an ESX portgroup.

There is a whitepaper on it - not sure if you've seen

it:

http://www.vmware.com/pdf/esx3_vlan_wp.pdf

There are a couple of commands mentioned at the

bottom which might make a difference?

"vlan dot1q tag native" etc...

If you can it might be worth temporarily setting up a

standalone server to host DHCP/PXE - then you will be

able to pinpoint your problem a little better.

If that works OK - then it would point to issues with

passing the native vlan on to the ESX port group - if

that fails - then it is presumably more to do with

your actual Cisco environment?

Dinny

0 Kudos
VMELH
Contributor
Contributor

Hello,

I also PXEboot/install VMs and hosts from my systems. I have a standalone DHCP server, well it is more than DHCP but it is external to the ESX server.

I would try this combination without Vlan Tagging and see if that solves the problem. If this works then I would introduce one change at a time to see where the break is. It could be the physical switch vlan or even vlan tagging in the vSwitch. I think this may be your only way of telling if it is one or the other.

I would also check the DHCP Server for DHCP related messages, it could be that the DHCP server is not responding correctly.

Just a thought. We considered having the DHCP server inside ESX, but dismissed that as everything at the site uses DHCP except for PDUs and the ESX servers themselves.

Best regards,

Edward

0 Kudos
moberle
Contributor
Contributor

This is going to be my next coarse of action. I already tested the dhcp within the vSwitch on the same ESX host. I created another VM on the same ESX Host as the DHCP server and booted it via PXE with a test PXELinux loader.

I plan on setting up the second adapter on the Vcenter server within the same native vLAN and install DHCP from there. This will provide probably be our permanent solution for DHCP in both of our ESX subnets (We have a half of each blade chassis setup as internal and half serving our DMZ VM's This solution is intended to provide provision of both. We will be adding probably 5 more chassis in the next 2 years. So an automated provisioning solution is needed to meet the growing environment.

Our environment is the opposite I am using DHCP to provision the servers only. The only other DHCP in our environment is in the user subnet that the developers workstations reside in. This is a very "Server" heavy environment. In that we have more than 3 times as many servers as workstations.

Hello,

I also PXEboot/install VMs and hosts from my systems.

I have a standalone DHCP server, well it is more than

DHCP but it is external to the ESX server.

I would try this combination without Vlan Tagging and

see if that solves the problem. If this works then I

would introduce one change at a time to see where the

break is. It could be the physical switch vlan or

even vlan tagging in the vSwitch. I think this may be

your only way of telling if it is one or the other.

I would also check the DHCP Server for DHCP related

messages, it could be that the DHCP server is not

responding correctly.

Just a thought. We considered having the DHCP server

inside ESX, but dismissed that as everything at the

site uses DHCP except for PDUs and the ESX servers

themselves.

Best regards,

Edward

0 Kudos
dinny
Expert
Expert

Hiya,

Not sure if it will help much - as I don't ever use the Native VLAN zz on the VM portgroups at all - but here is how the (working) Cisco IOS port is set up.

I use this port to PXE boot via the native VLAN zz to actually build the ESX server, once the ESX server is built I use it as my primary SC connection on vlan xx, and my backup vmotion connection on vlan yy.

i.e. I have two port groups set up on vSwitch0 - one on vlan xx for the SC and one on vlan yy for vmotion.

The zz native vlan is purely used by the PXE agent on the NIC at the initial ESX server build time. After build it is not used at all - ESX knows nothing about it.

interface GigabitEthernet pp/qq

description blah de blah.....

no ip address

no snmp trap link-status

switchport

switchport trunk encapsulation dot1q

switchport trunk native vlan zz

switchport trunk allowed vlan xx,zz,yy

switchport mode trunk

switchport nonegotiate

spanning-tree portfast trunk

Dinny

0 Kudos
moberle
Contributor
Contributor

TY Dinny

I have one question though. If your ESX host (the one that the UD toaster is on) doesnt have the native vlan defined in a vswitch. How is the UD toaster (sorry bad humour) VM seeing the Native vlan.

Maybe this is the part that I'm missing. I assumed that the vSwitch would drop the untagged packets since it wouldnt know what vlan to send them out on, because by default vswitches dont participate in vlan 0. I took that to mean they would not send untagged packets at all.

Am I incorrect in my understanding of that?

Sorry my ESX experience is limited so am flying by the seat of my pants here.

0 Kudos
dinny
Expert
Expert

Hiya,

My UDA app is a VM - but that is on a completely different vlan to the native vlan that I PXE boot on, and is different again to the vlan that the SC is on.

When the PXE agent on the ESX server NIC tries to PXE boot on vlan zz - the Cisco VLAN is set up to have IP forwarders to my DHCP servers - lets say the dhcp servers are on vlan ss?

I then have a dhcp scope set up (on my dhcp servers sitting on vlan ss) for vlan zz (that I am pxe booting on)

I then use the DHCP scope option 66 to give the IP address of my UDA app.

Lets say the UDA app is on VLAN tt.

(Which happens to be a portgroup for my VMs on my ESX server - but it could just as well be a standalone server)

The process is then as follows:

The unbuilt ESX server boots on native vlan zz

An ip forwarder statement on cisco passes the dhcp request from vlan zz to one of my (standalone) DHCP servers on vlan ss

One of theDHCP servers replies with an IP address on vlan zz

Once a dhcp lease has been negotiated, the dhcp scope option 66 tells my unbuilt ESX server to go to my UDA app (on vlan tt) to get it's PXE info.

Option 67 in the dhcp scope tells it what PXE file to use.

The build then commences, using a DHCP address on vlan zz, and getting the relevant kickstart data from the UDA server on vlan tt, using http or NFS, in the usual manner.

The irritating part is that when the ESX server is built I actually want the service console to be on VLAN xx and vmotion yy.

So I then have to script setting up the relevant port groups and IP addresses in the rc.local using esxcfg and vimsh commands.

Finally I reboot - and I'm happily using the ESX server on vlan xx for the SC and vlan yy for vmotion.

I only ever use the native vlan zz on that cisco port again, if I ever need to rebuild my ESX server.

Hope that makes sense?

Must be worth a few points if it does Smiley Happy

Dinny

0 Kudos
VMELH
Contributor
Contributor

Hello,

You are using DHCP only during the provisioning of an ESX Host and not during the running of an ESX host? If the first is the case, we do the same, all our ESX Servers are provisioned using DHCP and then we finally give them a static address. We then download and run a configuration script.

If it is the second, where you are using DHCP for an ESX server service console, that is not necessarily a great idea. Everytime I have tried to do that I end up with very interesting logfiles of the DHCP server, specifically, the MAC address of the ESX Server never stayed the same twice.

How is your DHCP server configured? Only to allow by specific MAC Address or a range of open ports.

Best regards,

Edward

0 Kudos
meistermn
Expert
Expert

Did you install ESX per pxe with connected san?

http://www.vmware.com/community/thread.jspa?threadID=86894&tstart=10

0 Kudos
moberle
Contributor
Contributor

ok the picture is much clearer.

The answer is the DHCP/bootp forwarder. The forwarder is acting like a vlan traffic cop. It is taking the broadcast traffic and pointing it to a specific DHCP server on another vlan. It then catches the reply from that server on that vlan and rebroadcasts it on the native vlan to the PXE client. Am I correct in this interpretation.

My DHCP server is setup very simular to what you describe here the difference being my DHCP server is also my PXE server. The (66) Bootserver host name is pointing to its own IP address. This config is working within the vSwitch environment on the host that it resides on. The VM's can PXE boot to it and receive the boot loader file from the DHCP/PXE server so I know that part is working.

Your wrote "When the PXE agent on the ESX server NIC tries to PXE boot on vlan zz - the Cisco VLAN is set up to have IP forwarders to my DHCP servers"

I assume you are referring to an "IP helper-address" on the interface of your native vlan. That seems to be the most likely place to put the forwarder. But the vlan interface doesnt have an IP address to respond to the request??? So the DHCP server has no way to respond.

So if it is possible can you and I talk via IM or take this private I know I'm asking alot here. My email address is moberle@NRTwebservices.com.

Youve been more than accomodating. TYVM. I appreciate the help you've already given me. And Yes bunches of Points will be awarded!!! LOL

Hiya,

My UDA app is a VM - but that is on a completely

different vlan to the native vlan that I PXE boot on,

and is different again to the vlan that the SC is

on.

When the PXE agent on the ESX server NIC tries to PXE

boot on vlan zz - the Cisco VLAN is set up to have IP

forwarders to my DHCP servers - lets say the dhcp

servers are on vlan ss?

I then have a dhcp scope set up (on my dhcp servers

sitting on vlan ss) for vlan zz (that I am pxe

booting on)

I then use the DHCP scope option 66 to give the IP

address of my UDA app.

Lets say the UDA app is on VLAN tt.

(Which happens to be a portgroup for my VMs on my ESX

server - but it could just as well be a standalone

server)

The process is then as follows:

The unbuilt ESX server boots on native vlan zz

An ip forwarder statement on cisco passes the dhcp

request from vlan zz to one of my (standalone) DHCP

servers on vlan ss

One of theDHCP servers replies with an IP address on

vlan zz

Once a dhcp lease has been negotiated, the dhcp scope

option 66 tells my unbuilt ESX server to go to my

UDA app (on vlan tt) to get it's PXE info.

Option 67 in the dhcp scope tells it what PXE file to

use.

The build then commences, using a DHCP address on

vlan zz, and getting the relevant kickstart data from

the UDA server on vlan tt, using http or NFS, in the

usual manner.

The irritating part is that when the ESX server is

built I actually want the service console to be on

VLAN xx and vmotion yy.

So I then have to script setting up the relevant port

groups and IP addresses in the rc.local using esxcfg

and vimsh commands.

Finally I reboot - and I'm happily using the ESX

server on vlan xx for the SC and vlan yy for

vmotion.

I only ever use the native vlan zz on that cisco port

again, if I ever need to rebuild my ESX server.

Hope that makes sense?

Must be worth a few points if it does Smiley Happy

Dinny

0 Kudos
moberle
Contributor
Contributor

I am building the ESX server using DHCP only. It will have its own static address after it the build is complete.

0 Kudos
dinny
Expert
Expert

Hi Moberle,

Dropped you an email a few hours ago...

Did it arrive?

Off home now....

Dinny

0 Kudos
moberle
Contributor
Contributor

Yes it did. I was already gone from work

I have sent a reply.

Regards,

Michael

Hi Moberle,

Dropped you an email a few hours ago...

Did it arrive?

Off home now....

Dinny

0 Kudos
moberle
Contributor
Contributor

Dinny

Hope things are good.

I am posting this to the forum as well.

I have finally had time to get back to this.

Have had partial success.

Set VLAN 322 as a Layer three VLAN with an IP address and IP Helper Addresses. (This of course is complicated by the fact that we have 2 6509’s working as redundant routers, with GLBP set and working. And we have 1 NIC on each server attached to each Router.)

interface Vlan322 (on Router 0)

ip address 10.234.20.2 255.255.255.240

ip helper-address 10.158.121.94

glbp 5 ip 10.234.20.1

glbp 5 load-balancing host-dependent

interface Vlan322 (on Router 1)

ip address 10.234.20.3 255.255.255.240

ip helper-address 10.158.121.94

glbp 5 ip 10.234.20.1

glbp 5 load-balancing host-dependent

removed no ip forward-protocol udp TFTP to enable TFTP forwarding.

Added ip forward-protocol udp bootps to enable DHCP forwarding.

ip forward-protocol udp bootpc to enable PXE boot forwarding.

PXE Interface Cisco Port

interface GigabitEthernet3/42

switchport

switchport trunk encapsulation dot1q

switchport trunk allowed vlan 2-4094

switchport mode trunk

switchport nonegotiate

no ip address

no cdp enable

spanning-tree portfast trunk

spanning-tree bpduguard enable

spanning-tree guard root

ESX Host Port that the MS DHCP VM resides on.

interface GigabitEthernet3/39

switchport

switchport trunk encapsulation dot1q

switchport trunk native vlan 322

switchport trunk allowed vlan 2-4094

switchport mode trunk

switchport nonegotiate

no ip address

no cdp enable

spanning-tree portfast

spanning-tree bpduguard enable

spanning-tree guard root

And Wallaaaa.

The NIC is Broadcasting a DHCP Discover Packet.

The VLAN Interface (IP Helper) is relaying DHCP Discover to the DHCP server.

The DHCP Server is responding with a DHCP OFFER back to the IP Helper.

The IP Helper is broadcasting the OFFER back to the NIC.

The NIC is seeing the OFFER and responding with a DHCP REQUEST.

The IP Helper is Relaying the DHCP REQUEST to the DHCP Server.

The DCHP server is responding with a DHCPACK to the IP helper.

The IP Helper is broadcasting the DHCPACK to the NIC.

And we now have an IP address and BOOTP with PXE options on the NIC.

And all is well, Well almost.

The PXE boot now attempts to TFTP the PXE boot file (pxelinux.0) from the PXE server. (The works when another VM PXE boots)

It broadcasts an ARP request for the MAC address of the IP helper. (Not sure why this is happening)

The IP helper responds with an ARP Reply. …….. But the reply is never seen resulting in a PXE-E11. ARP Timeout error.

I have done some Googling of the this and found two possible errors that seem to apply.

DHCP option 60 on a MS DHCP server with a DHCPproxy running (Applies here. I removed the option with no effect.)

Cisco Routers not forwarding ARP requests (the request is received and replied to. And both interfaces are in the same subnet and VLAN) so doestnt seem to be an issue.

Anyone have any guidance on this. Points will be awarded to all (of course someone will have to tell me how to do that too LOL)

0 Kudos
dinny
Expert
Expert

Hi Michael,

You no longer seem to have the native VLAN statement set on port 3/42?

It was on in your first post?

Mind you I don't see how all the other bits that now work would work without that?

What is option 60 on MS DHCP?

My DHCP server dosn't seem to recognise that?

To award points.

Click on "helpful" when you reply to award six points

You can do this twice per thread.

Click on "correct" if it is answered to award 10 points

Dinny

View solution in original post

0 Kudos
moberle
Contributor
Contributor

Ok Heres the deal, Finally I figured it out. Thanks to alot of help from you guys.

The only way that I got that far was to remove the Native VLAN from that interface. The reason it worked (and the reason it didnt) is that the 1955 uses Broadcom NETXtreme II Network cards. Which allow the setting of VLAN processing during PXE Boot. I had set the card to PXE boot using VLAN 322 in the PXE Settings. This made the DHCP part work. But the ARP broadcast failed because it didnt understand the Tagged Frame.

To fix this I disabled the VLAN settings in the PXE settings on the card and re-added the Native setting to the Interface. And dam if it didnt work.

DHCP Option 60 is used to designate the DHCP server as the PXE server in an environment that doesnt use DHCP relays. It has to be added using the NETSH command. MS doesnt officially support this option. And it did work before I added the Layer 3 VLAN IP helpers.

Now to actually get ESX to install from the network!!!!!

Thank you all for your help and suggestions.

and especially to Dinny for his patience.

Here is a synopisis of how to get PXE to work in a MS environment running on a VM.

Set VLAN xxx (In my case we used VLAN 322 for this purpose) as a Layer three VLAN with an IP address and IP Helper Addresses. (This of course is complicated by the fact that we have 2 6509’s working as redundant routers, with GLBP set and working. And we have 1 NIC on each server attached to each Router.)

interface Vlan322 (on Router 0)

ip address 10.234.20.2 255.255.255.240

ip helper-address 10.158.121.94

glbp 5 ip 10.234.20.1

glbp 5 load-balancing host-dependent

interface Vlan322 (on Router 1)

ip address 10.234.20.3 255.255.255.240

ip helper-address 10.158.121.94

glbp 5 ip 10.234.20.1

glbp 5 load-balancing host-dependent

removed no ip forward-protocol udp TFTP to enable TFTP forwarding.

Added ip forward-protocol udp bootps to enable DHCP forwarding.

PXE Interface Cisco Port

interface GigabitEthernet3/42

switchport

switchport trunk encapsulation dot1q

switchport trunk native vlan 322

switchport trunk allowed vlan 2-4094

switchport mode trunk

switchport nonegotiate

no ip address

no cdp enable

spanning-tree portfast trunk

spanning-tree bpduguard enable

spanning-tree guard root

Cisco Port of the ESX Host that the MS DHCP VM resides on.

interface GigabitEthernet3/39

switchport

switchport trunk encapsulation dot1q

switchport trunk native vlan 322

switchport trunk allowed vlan 2-4094

switchport mode trunk

switchport nonegotiate

no ip address

no cdp enable

spanning-tree portfast

spanning-tree bpduguard enable

spanning-tree guard root

Set DHCP Options 66 and 67 on the DHCP server

Option 66 is the TFTP Server name (Use the IP address DNS is not supported)

Option 67 is the PXE boot file name in my case (PXELinux.0)

Install MS WDS services on the DHCP server or another server (the IP address must correspond to the IP mentioned in Option 66 above.

Copy the PXELinux.0 file to the root of the TFTP folder and create the PXELinux.cfg folder. (This is documented on the PXELinux website.)

Alternatively you can change the TFTP root folder by doing a Registry edit. (I found this documented on the PXELinux website as well)

I have awarded points to all who helped TY again.

0 Kudos