Re: PXE "firmware" BUG

patpat · ‎10-26-2013

I think I have discovered a bug in VMware's PXE "firmware" layer.

The current test scenario is:

Host OS: Windows 8 Pro

VMware: Workstation 9

PXE server: Serva

When PXE installing guests OSs like Vista, 7, 8 the TFTP server transfers a Boot.wim of about 200Mb
That transfer presents a consistent TFTP error pattern that can be seen on Serva's Log like :

run#1

[10/26 16:15:38.723] TFTP Inf: Read file <\WIA_WDS\s2012_R2\_SERVA_\boot\ServaBoot.wim>. Mode octet

[10/26 16:15:43.553] TFTP Err: timeout waiting for ack blk 16#24032 #24032

[10/26 16:15:49.675] TFTP Err: timeout waiting for ack blk 16#56792 #56792

[10/26 16:15:55.696] TFTP Err: timeout waiting for ack blk 16#24016 #89552

[10/26 16:16:01.583] TFTP Err: timeout waiting for ack blk 16#56776 #122312

[10/26 16:16:06.391] TFTP Inf: <\WIA_WDS\s2012_R2\_SERVA_\boot\ServaBoot.wim>: sent blks=152873 blkSz=1456, Total 222629455 bytes in 28s, err recovery=4

run#2

[10/26 16:16:42.533] TFTP Inf: Read file <\WIA_WDS\s2012_R2\_SERVA_\boot\ServaBoot.wim>. Mode octet

[10/26 16:16:48.524] TFTP Err: timeout waiting for ack blk 16#32112 #32112

[10/26 16:16:54.341] TFTP Err: timeout waiting for ack blk 16#64872 #64872

[10/26 16:17:00.202] TFTP Err: timeout waiting for ack blk 16#32096 #97632

[10/26 16:17:06.009] TFTP Err: timeout waiting for ack blk 16#64856 #130392

[10/26 16:17:09.371] TFTP Inf: <\WIA_WDS\s2012_R2\_SERVA_\boot\ServaBoot.wim>: sent blks=152873 blkSz=1456, Total 222629455 bytes in 27s, err recovery=4

run#3

[10/26 16:17:59.439] TFTP Inf: Read file <\WIA_WDS\s2012_R2\_SERVA_\boot\ServaBoot.wim>. Mode octet

[10/26 16:18:01.620] TFTP Err: timeout waiting for ack blk 16#7424 #7424

[10/26 16:18:07.492] TFTP Err: timeout waiting for ack blk 16#40184 #40184

[10/26 16:18:13.389] TFTP Err: timeout waiting for ack blk 16#7408 #72944

[10/26 16:18:19.262] TFTP Err: timeout waiting for ack blk 16#40168 #105704

[10/26 16:18:25.374] TFTP Err: timeout waiting for ack blk 16#7392 #138464

[10/26 16:18:27.630] TFTP Inf: <\WIA_WDS\s2012_R2\_SERVA_\boot\ServaBoot.wim>: sent blks=152873 blkSz=1456, Total 222641103 bytes in 28s, err recovery=5

Serva TFTP windowsize = 16 8 for the tests above

The error pattern is as follow:

There's an initial error on a random block # < 65535 32767, next the error periodically repeats in this case every 32760 blocks.
The period changes slightly if the windowsize is changed.
The error consists of the client not sending the corresponding ACK for a perfectly sent block (Wireshark),

The TFTP server recovers the errors and the file gets finally transferred with some delay.

The error pattern is only present when the net traffic uses virtual NICs only.
i.e. between Host and Workstation Guests running on the same hardware.
if the TFTP traffic "touches" a real Ethernet wire the error is not seen.

I have used other PXE servers (3com) and the pattern is also there

I have used different Host OS (Vista) and the pattern is also there

I have used old Workstation versions (6 and 😎 and the pattern is also there

I've been reported from people using ESX experiencing the same problem.

I have tested changing the TFTP transfer inter-packet-delay and the pattern is not affected.

Initially the TFTP protocol only allowed sending max 32767 ACK but that is not enforced now.

I think you guys have some uninitialized (random value) short counter variable (2 bytes) triggering the first error when
changing its sign then doing it again periodically when the counter variable changes its sign again.

If you need more info let me know.

Patrick

dariusd · ‎10-26-2013

Hi Patrick!

Those are some pretty fine observations you have made there... very diligent and methodical :smileycool:. Lots of juicy information to help eliminate a few possible causes. Let me see if I can continue...

The initial error seems to also always be on a block < 32768.

If you vary windowSize, does the interval of the timeout happen to be 32768 - windowSize / 2 blocks? That would be super interesting.

Other than your "real Ethernet wire" tests, are you always running the TFTP server (Serva or 3com) on the same (Vista or Win8Pro) host machine as you're running Workstation (6/8/9) and the virtual machines?

The most interesting thing about the logs is that there is another pattern across the three runs, looking at the packet number of the first timeout. 32112-24032 is 8080. 7424-32112+32768 (for a 15-bit wraparound) is also 8080! I'd be very interested to see if that pattern held up to further observation and scrutiny. If so, it strongly suggests that the fault involves state which exists or persists outside the virtual machine, not an uninitialized variable.

Cheers,

--

Darius

patpat · ‎10-27-2013

Hi Darius,

1) you are right; the initial error is on a block < (0xFFFF / 2) and I mistakenly previously wrote 0xFFFF = 65535

2) I have performed that test also; the following error period T=f(windowsize)

Period windowsize

32767 1

32766 2

32764 4

32760 8

32752 16

32736 32

32704 64

my tests show that
T= 32768 - windowsize

3) The tests included in this report have been conducted with Serva and Workstation running on the same PC; then the TFTP traffic generated when PXE installing a Guest OS never touches an Ethernet wire.

I have also conducted tests where Serva was running on PC1 and Workstation was running on PC2; here the TFTP traffic touched a real Ethernet wire. In this case the error pattern is not present and the transfer is consistently performed error free.

The tests were all run in different PCs in order to rule out any hardware dependency; they were mainly performed on different models of Toshiba Tecra and HP Elitebook.

4)

a few more runs

run#4

[10/27 01:51:43.003] TFTP Inf: Read file <\WIA_WDS\s2012_R2\_SERVA_\boot\ServaBoot.wim>. Mode octet

[10/27 01:51:44.367] TFTP Err: timeout waiting for ack blk 16#384 #384

[10/27 01:51:50.264] TFTP Err: timeout waiting for ack blk 16#33144 #33144

[10/27 01:51:56.123] TFTP Err: timeout waiting for ack blk 16#368 #65904

[10/27 01:52:02.732] TFTP Err: timeout waiting for ack blk 16#33128 #98664

[10/27 01:52:08.654] TFTP Err: timeout waiting for ack blk 16#352 #131424

[10/27 01:52:11.889] TFTP Inf: <\WIA_WDS\s2012_R2\_SERVA_\boot\ServaBoot.wim>: sent blks=152873 blkSz=1456, Total 222641103 bytes in 28s, err recovery=5

run#5

[10/27 02:14:46.299] TFTP Inf: Read file <\WIA_WDS\s2012_R2\_SERVA_\boot\ServaBoot.wim>. Mode octet

[10/27 02:14:51.217] TFTP Err: timeout waiting for ack blk 16#26240 #26240

[10/27 02:14:57.039] TFTP Err: timeout waiting for ack blk 16#58992 #58992

[10/27 02:15:02.853] TFTP Err: timeout waiting for ack blk 16#26216 #91752

[10/27 02:15:08.643] TFTP Err: timeout waiting for ack blk 16#58976 #124512

[10/27 02:15:12.809] TFTP Inf: <\WIA_WDS\s2012_R2\_SERVA_\boot\ServaBoot.wim>: sent blks=152873 blkSz=1456, Total 222629455 bytes in 26s, err recovery=4

run#6

[10/27 02:16:01.295] TFTP Inf: Read file <\WIA_WDS\s2012_R2\_SERVA_\boot\ServaBoot.wim>. Mode octet

[10/27 02:16:04.812] TFTP Err: timeout waiting for ack blk 16#16544 #16544

[10/27 02:16:10.668] TFTP Err: timeout waiting for ack blk 16#49304 #49304

[10/27 02:16:16.832] TFTP Err: timeout waiting for ack blk 16#16528 #82064

[10/27 02:16:22.688] TFTP Err: timeout waiting for ack blk 16#49288 #114824

[10/27 02:16:28.583] TFTP Err: timeout waiting for ack blk 16#16512 #147584

[10/27 02:16:31.989] TFTP Inf: <\WIA_WDS\s2012_R2\_SERVA_\boot\ServaBoot.wim>: sent blks=152873 blkSz=1456, Total 222641103 bytes in 30s, err recovery=5

I have theorized about an uninitialized variable just to explain a random# that seems to have no dependency on timing (inter-frame-delay).

My first thought was that the initial error was related to some transfer speed issue adding this way the randomness to the pattern; but it is not the case.

I think at this point it's easier to check the code than trying to find more peculiarities in the pattern. At least that's what I'd do if I'd have to fix this thing 😉

5) Extra info not mentioned before

I have used different VMware NIC emulations and the pattern is always there

Let me know If you need more info.

Patrick

dariusd · ‎10-27-2013

Hi Patrick,

Just to double-check, in your original post you wrote that you used a windowsize of 16 and saw a period of 32760, yet your latest post says that a windowsize of 16 gave a period of 32752... can you confirm that this was an error in your original post and that your second post is correct?

I think at this point it's easier to check the code than trying to find more peculiarities in the pattern. At least that's what I'd do if I'd have to fix this thing 😉

I had already looked through the PXE ROM code and researched the way in which it interacts with Microsoft's bootloader before writing my reply.

Finding patterns is a crucial part of debugging... otherwise we would not know where in the code to look and what to look for, and indeed we might overlook a clue telling us that it's not our fault at all. It's amazing how often we end up debugging problems in guest OSes, host OSes, firewall software, application software, even hardware... :smileygrin:

Speaking of which... do you have any firewall/network/Internet security software installed on the host, even if it is disabled, and even if it is configured "open" to allow all the relevant services?

Cheers,

--

Darius

patpat · ‎10-27-2013

Hi Darius,

1) I have just checked Serva TFTP windowsize for the tests posted here; I confirm it's not 16 as I wrote in the first post but "8"; then the function T=f(windowsize) posted in my second post is valid. I wish you can edit and correct my first post in order to avoid inducing to error.

2) I think the problem is not just related to Microsoft bootloaders; the pattern is present when TFTP transferring other non MS stuff like Linux installs or some big TFTP transfers when using i.e. the pxelinux memdisk approach. The pattern is sometimes hidden when the files are small and their transfers just end before the first error happens.

About looking for more patterns you might be right; IMHO there many leads pointing to the TFTP ACK circuitry; I'd personally try to delve into the code but I understand your point about trying to avoid ending up debugging somebody else's bug 😉

3) The host OS (Windows 8 pro) has only the native firewall w/o any other 3rd party security product.
I really cannot imagine of a firewall counting ACKs and preventing periodically some of them though.

Best,

Patrick

dariusd · ‎10-27-2013

I have thought of two more tests that would be interesting to perform...

1) What happens if you simultaneously PXE boot two VMs in the failing configuration? Do you have the resources to create a second VM on that host to try it out for me?

It'd make a massive difference to the debugging process to observe whether each VM continues to fail with the same period as when run standalone, or whether the behavior changes. It might be that each VM starts to fail twice as often, or that the failures become irregular but with a higher overall frequency, or something else altogether...

2) Does the outcome change if you run with 1 virtual processor core (1 processor and 1 core per processor) versus two or more? I doubt the number of cores would make any difference beyond 2... if there is any effect, it'll probably be "one" versus "more than one".

Thanks,

--

Darius

dariusd · ‎10-27-2013

Hi again Patrick,

Thanks for the info.

You can edit your earlier post if you want to... When you are logged in to Communities and viewing this thread, there should be an "Edit" button underneath each of your posts. When I make edits to my posts, I tend to insert a comment to make the thread sensible for later readers (and so that it doesn't look like I asked a stupid question in my reply :smileysilly:)... something like "Serva TFTP windowsize=8 for the tests above (EDIT: was mistakenly 16 in original post)".

I don't think it's a problem with TFTP ACKs. My reasoning: At the TFTP level, there should be no variation in the content of the TFTP ACKs for any particular block from one run to the next. The ACK includes the block number that's being acked and a constant opcode... that's all. If it was a failure at the TFTP level, I'd expect to see the failure at the same block number in each run. There is no scope for uninitialized data at the TFTP level to influence the failure.

Now... at the UDP level, there is a 16-bit checksum applied to the packet. The UDP source port and destination port are included in that checksum, and those port numbers can vary across runs. So, for each run, there will be a different point during the transfer where the UDP checksum will be precisely zero, or precisely 65535, and so forth. The value "0" in a UDP checksum is treated specially, meaning that the sender did not include a checksum, and "If the checksum calculation results in the value zero (all 16 bits 0) it should be sent as the one's complement (all 1s)" (cite: wikipedia). That's the sort of thing that could possibly be mishandled by either the sender or receiver of a UDP packet. That would provide a neat explanation for a failure with a periodicity of 65536 blocks which varies across runs, but it doesn't immediately explain a failure with a periodicity of 32768 blocks, and certainly can't explain a failure with periodicity of 32768-windowSize blocks.

Any chance you could post a fragment of the wireshark capture spanning about 20 packets either side of a missing ACK?

Cheers,

--

Darius

patpat · ‎10-27-2013

1) this is the Log for simultaneously PXE installing the same guest OS (only the ServaBoot.wim transfer)

[10/27 17:17:22.443] TFTP Inf: Read file <\WIA_WDS\s2012_R2\_SERVA_\boot\ServaBoot.wim>. Mode octet

[10/27 17:17:25.629] TFTP Inf: Read file <\WIA_WDS\s2012_R2\_SERVA_\boot\ServaBoot.wim>. Mode octet

[10/27 17:17:26.598] TFTP Err: timeout waiting for ack blk 16#20856 #20856

[10/27 17:17:28.284] TFTP Err: timeout waiting for ack blk 16#10064 #10064

[10/27 17:17:30.092] TFTP Err: timeout waiting for ack blk 16#37944 #37944

[10/27 17:17:36.486] TFTP Err: timeout waiting for ack blk 16#59008 #59008

[10/27 17:17:40.764] TFTP Err: timeout waiting for ack blk 16#38512 #104048

[10/27 17:17:42.348] TFTP Err: timeout waiting for ack blk 16#25840 #91376

[10/27 17:17:45.237] TFTP Err: timeout waiting for ack blk 16#38776 #104312

[10/27 17:17:47.067] TFTP Err: timeout waiting for ack blk 16#9336 #140408

[10/27 17:17:48.683] TFTP Err: timeout waiting for ack blk 16#54824 #120360

[10/27 17:17:48.898] TFTP Inf: <\WIA_WDS\s2012_R2\_SERVA_\boot\ServaBoot.wim>: sent blks=152873 blkSz=1456, Total 222629455 bytes in 26s, err recovery=4

[10/27 17:17:53.258] TFTP Err: timeout waiting for ack blk 16#13744 #144816

[10/27 17:17:54.440] TFTP Inf: <\WIA_WDS\s2012_R2\_SERVA_\boot\ServaBoot.wim>: sent blks=152873 blkSz=1456, Total 222652751 bytes in 29s, err recovery=6

It seems the pattern changes; It looks like there's not a period any more?
Please consider Serva errors are interleaved for both files.

2)

All the previous test were performed with 1 processor, 1 core per processor ;
the following test used 2 processor, 2 core per processor

[10/27 18:25:18.604] TFTP Inf: Read file <\WIA_WDS\s2012_R2\_SERVA_\boot\ServaBoot.wim>. Mode octet

[10/27 18:25:22.267] TFTP Err: timeout waiting for ack blk 16#17336 #17336

[10/27 18:25:28.153] TFTP Err: timeout waiting for ack blk 16#50096 #50096

[10/27 18:25:34.091] TFTP Err: timeout waiting for ack blk 16#17320 #82856

[10/27 18:25:41.220] TFTP Err: timeout waiting for ack blk 16#50080 #115616

[10/27 18:25:47.200] TFTP Err: timeout waiting for ack blk 16#17304 #148376

[10/27 18:25:47.869] TFTP Inf: <\WIA_WDS\s2012_R2\_SERVA_\boot\ServaBoot.wim>: sent blks=152873 blkSz=1456, Total 222641103 bytes in 29s, err recovery=5

3) I edited the first post striking through the wrong values

4) I got the pcap capture (find it attached to the first post).

Open the file, see it as TFTP traffic considering the following error

[10/27 17:48:15.758] TFTP Err: timeout waiting for ack blk 16#62632 #62632

Best,

Patrick

Message was edited by: patpat

dariusd · ‎10-27-2013

Interesting... the capture shows that the missing ACK would have been issued for a packet for which the IP checksum was about to roll over from 0xffff to 0x0001 (packets 11 and 12 of the capture). The IP checksum is changing because the sender is populating the IPv4 Identification field with a monotonically increasing count.

Is the same IP checksum rollover also happening at other points in the capture (and other captures) where an ACK went missing? I'm particularly interested in the next missing ACK, which should be at around block number 95385, or the prior one, at 29865.

Thanks,

--

Darius

patpat · ‎10-28-2013

there might be something here...

I just run a new capture; All the windows that presented the missing ACK issue
had always contained a packet with the IP checksum = 0xFFFF

Probably the receiver is quietly dropping all datagrams containing an IP header with a checksum of 0xFFFF ?

if that is the case then the problem could be not related to the PXE ROM but to the IP stack UDP handling instead ?

but if the whole VMware transport would discard UDP IP with IP checksum=0xFFFF this error must've been found before,

then we go back to the PXE ROM UDP/IP stack ?

Best,

Patrick

dariusd · ‎10-28-2013

Hey Patrick!

I've dug deeper into this... It turns out that an IP checksum of 0xFFFF is somewhat unusual and should not really happen, even though wireshark reports it as valid. See RFC 1624 under "3. Discussion". Our PXE ROM is rejecting those packets because of their invalid checksum. BUT... RFC 1624 also explains how RFC 1141 can bring about the transmission/forwarding of packets with an IP checksum of 0xFFFF, and if we'd been validating the checksum in the recommended way (per RFC 1071), we would be accepting the packets anyway.

So... the issue is a combination of something weird at your site which is issuing UDP packets with an IP checksum 0xFFFF combined with us being unnecessarily strict and rejecting those packets outright.

I've filed a bug report to have our checksum validation "loosened" to accept 0xFFFF (realistically: to have it implemented the way it should have been all along...).

Thanks,

--

Darius

dariusd · ‎10-28-2013

Hi again Patrick,

I also noticed that the release notes for Serva 2.1.3 indicate that it corrects an issue related to "TFTP RFC-1350 error recovery under certain conditions"... which could conceivably be the issuance of IP packets with 0xFFFF header checksum... If you're not already using Serva 2.1.3, you may want to consider updating. (The release-noted issue might be something else altogether, but it's worth trying...)

Cheers,

--

Darius

patpat · ‎10-29-2013

1) I think IP checksum 0xFFFF is perfectly valid today by RFC-1141

2) I really cannot see anything weird at my site; the IP header checksum is supposed to be calculated and injected by the Winsock layer when the function sendto() sends a TFTP datagram to the client. Then it is Microsoft who puts the IP checksum = 0xFFFF

3) I honestly think you guys are not loosening things when accepting checksum 0xFFFF; it's just following the current standard.

4) I'm using Serva's latest version and as I said before it is not Serva who calculates/injects the IP header checksum; it is Microsoft.
Serva calls TFTP RFC-1350 when windowsize is not parsed (we are parsing windowsize in all these tests) plus that bug fix improves Serva's TFTP recovery capabilities; those capabilities are the ones that so far allowed to circumvent this bug.

Anyway I think this VMware PXE ROM bug fix will make TFTP transfers faster and more reliable;

Thanks for a great product and great support.

Best,

Patrick

PS I've sent you a PM

dariusd · ‎10-29-2013

Hi Patrick!

Yeah, it looks like RFC 1141 accidentally made an IP checksum of 0xFFFF possible, and RFC 1624 tries to clean up the resulting confusion, while highlighting that the recommended checksum verification method in RFC 1071 would have avoided any problems in the first place. :smileycry:

Hopefully we will soon be using the RFC 1071 recommended checksum verification method and the problem will be solved -- as I said above, this is how it should have been done in the first place. Sorry for the trouble, and many thanks for your help figuring this out.

Cheers,

--

Darius

patpat · ‎01-08-2015

This problem looks fixed at least in Workstation 11

Thanks VMware
Patrick