VMware Communities
admin
Immortal
Immortal
Jump to solution

Bug Hunt: Lost DNS

Description

A number of people have reported that networking initially works, but seems to stop after a period of time, perhaps hours or days. Specifically, DNS lookups (e.g. www.vmware.com or www.apple.com) stop working, but if you were to access a site by IP address (e.g. 72.247.74.52 or 17.112.152.32), it does work. Access by IP address working is an important distinction between this and any other bugs; if access by IP address doesn't work, it may not be related. Another critical feature of this bug is that the host networking still works . Reports seem to indicate that this problem is limited to NAT mode networking. Workarounds include running sudo /Library/Application Support/VMware Fusion/boot.sh --restart.

We want to fix this, but have never been able to reproduce this bug in-house; we need your help to track it down. Somewhere along the way, the DNS lookup (or reply) is getting dropped, and we need to know where. To figure out where it's being dropped, we need packet traces from multiple layers at the same time (so if one of them shows the packets of interest but the next layer doesn't, we've gotten closer to identifying the problem). The layers are from inside the guest, from the virtual network layer, and from the host; all taken at the same time when networking doesn't work . A bonus would be a set of simultaneous traces when the system is good (so if you frequently see this problem, perhaps you could take a set when you start up, and if/when things go bad, take another set). When you take a set of traces, try to minimize other networking activity as it'll make the traces easier for us to read. You will need administrator privileges to take traces. Remember to generate some traffic from the guest while taking the traces, e.g. visit a website.

Taking a trace from inside the guest

Directions depend on the guest OS. Use your favorite packet capture tool, such as Wireshark.

I'll update this part with more details when I get time.

Taking a trace from the virtual network layer

Use vmnet-sniffer, located in /Library/Application Support/VMware Fusion/vmnet-sniffer (included with Fusion). Example syntax is

sudo /Library/Application\ Support/VMware\ Fusion/vmnet-sniffer -e -w ~/Desktop/vmnet.pcap vmnet8

This will log to vmnet.pcap on your desktop, overwriting any previous contents. When you're done taking the trace, use control-c to break.

Taking a trace from the host

Use your favorite packet capture tool, such as Wireshark or tcpdump (included in OS X). Make sure you're capturing on the interface that the virtual machine is using! For example, if you're using a wired connection, you might do

sudo /usr/sbin/tcpdump -i en0 -w ~/Desktop/host.pcap

This will log to host.pcap on your desktop, overwriting any previous contents. When you're done taking the trace, use control-c to break. Unlike vmnet-sniffer, you won't see packets as they come in.

Other information to include

  • Host OS, e.g. 10.5.5, 10.4.11, something else?

  • How you're accessing the network. Wired/wireless (what model access point)? Through a router (what brand), directly to a modem, something else?

  • What guest OS, including bitness, flavor, and patch level? e.g. "Windows XP Professional SP3 32-bit", or "Ubuntu 8.04 64-bit with all updates as of today"

  • What type of network adapter does the guest see? e.g. e1000, vmxnet, AMD PCnet-PCI II, something else?

  • How often does this occur?

Once you've got all this, zip it all up and get it to us somehow, such as by posting in this thread. If you're worried about the traces containing sensitive information, you can email them to fusion-feedback at the obvious domain. Thanks!

0 Kudos
1 Solution

Accepted Solutions
mykmelez
Enthusiast
Enthusiast
Jump to solution

I just sent the following message (along with the capture files) to fusion-feedback at the obvious domain:

Hello,

This message is in response to the Bug Hunt: Lost DNS thread <http://communities.vmware.com/thread/177416> on the community forums.

After experiencing the problem, I followed the instructions in the forum thread to capture packets from the guest, virtual network layer, and host while loading a Google URL <http://www.google.com/search?q=foobarbaz&ie=utf-8&oe=utf-8&aq=t&rls=org.mozilla:en-US:official&client=firefox-a>, which experienced a DNS query timeout.

I captured packets with the following commands:

sudo /usr/sbin/tcpdump -i en0 -w ~/Desktop/host.pcap

sudo /Library/Application\ Support/VMware\ Fusion/vmnet-sniffer -e -w ~/Desktop/vmnet.pcap vmnet8

sudo /usr/sbin/tcpdump -i eth4 -w /mnt/hgfs/myk/Desktop/guest.pcap

I have archived the three capture files in a ZIP file and attached it to this message.

My system configuration is Mac OS X 10.5.6 with all the latest updates as of today, a wired connection, an Ubuntu 8.04 32-bit guest OS with the latest updates as of yesterday, and the following network adapter (according to lshw):

*-network

description: Ethernet interface

product: 79c970

vendor: Advanced Micro Devices

physical id: 0

bus info: pci@0000:02:00.0

logical name: eth4

version: 10

serial: 00:0c:29:77:15:9e

size: 1GB/s

capacity: 1GB/s

width: 32 bits

clock: 33MHz

capabilities: bus_master ethernet physical logical tp 1000bt-fd

configuration: autonegotiation=off broadcast=yes driver=vmxnet driverversion=2.0.1.1 duplex=full firmware=N/A ip=172.16.75.128 latency=64 link=yes maxlatency=255 mingnt=6 module=vmxnet multicast=yes port=twisted pair speed=1GB/s

I see the problem every one to three days. When it occurs, DNS queries often time out, but it's not just DNS; network traffic in general is very slow, and many requests time out. For example, I might load a web page, and the DNS query on the hostname will succeed, but the web page's HTML will take forever to load, and some of the images embedded in it will fail to load (i.e. their loads will time out after a while, and the browser will show a "broken image" icon in their place).

At the time I captured these packets, only Finder, VMware Fusion, and Terminal were running in Mac OS X. Only Firefox and GNOME Terminal were running in Ubuntu. Well, plus the packet capturers and the usual daemons.

Let me know if I can provide any additional information.

Regards,

Myk Melez

View solution in original post

0 Kudos
10 Replies
mindedc
Contributor
Contributor
Jump to solution

FYI, I have experienced this exact issue and have noticed that indeed changing over to bridge mode cures it. I only noticed this when I started doing some work that required me to move around several hundred gigs of data from my windows partition. I moved from working out of my living room over 802.11n to working from my desk plugged into my gigabit network. I have NEVER seen this issue on wireless, only on ethernet and only on fusion 2.0. When I have experienced this, I had DNS and WINS hang, but IP connectivity worked fine. I did a wireshark capture from the host and saw it was sending out requests but not getting them back. I changed over to bridged mode and that cured it. The place where I noticed the issue was with local file copies to my linux server over CIFS/SMB/Samba. I have wireshark in the guest, the host and I have tcpdump on the DNS server (same as Samba server) and I will specifically go back into NAT mode and make captures when I re-create the problem.

Another realization I made is that CIFS performance sucks through the VMware NAT layer in the guest OS. I had performance problems writing out as little as 100 megs over the network. In bridge mode it takes a few seconds, in NAT mode it can take 30-40 seconds. The cache on the raid controller in the server has more than enough RAM to cache 100 megs worth of writes not to mention the cache in the server so a transfer that small should be basically limited by the wire and the CIFS protocol. I have noticed that if I shutdown vmware and restart it the performance gets significantly better.I have abandoned use of bridge mode while at home for that reason and only use it while at a customer's site. It would be nice to have a quick way to toggle the networking mode for when you drop your wireless link. Either that or have a second virtual NIC that is attached to the wireless link via NAT so that when your bridged ethernet link goes away the laptop can switch over to wireless.

0 Kudos
phegaro
Contributor
Contributor
Jump to solution

I am able to see this consistently everyday. I can provide all the logs but dont want to go through the trouble if you are already solved on this issue. Can you please respond if you want me to provide more information?

0 Kudos
admin
Immortal
Immortal
Jump to solution

Please do provide the logs! This is still an open question -- I'll mark it as answered once it's solved.

0 Kudos
mindedc
Contributor
Contributor
Jump to solution

I used to have this issue pretty consistently, but it has not presented itself after I came across this thread. If I can get it to occur I can provide traces...

0 Kudos
mykmelez
Enthusiast
Enthusiast
Jump to solution

I just sent the following message (along with the capture files) to fusion-feedback at the obvious domain:

Hello,

This message is in response to the Bug Hunt: Lost DNS thread <http://communities.vmware.com/thread/177416> on the community forums.

After experiencing the problem, I followed the instructions in the forum thread to capture packets from the guest, virtual network layer, and host while loading a Google URL <http://www.google.com/search?q=foobarbaz&ie=utf-8&oe=utf-8&aq=t&rls=org.mozilla:en-US:official&client=firefox-a>, which experienced a DNS query timeout.

I captured packets with the following commands:

sudo /usr/sbin/tcpdump -i en0 -w ~/Desktop/host.pcap

sudo /Library/Application\ Support/VMware\ Fusion/vmnet-sniffer -e -w ~/Desktop/vmnet.pcap vmnet8

sudo /usr/sbin/tcpdump -i eth4 -w /mnt/hgfs/myk/Desktop/guest.pcap

I have archived the three capture files in a ZIP file and attached it to this message.

My system configuration is Mac OS X 10.5.6 with all the latest updates as of today, a wired connection, an Ubuntu 8.04 32-bit guest OS with the latest updates as of yesterday, and the following network adapter (according to lshw):

*-network

description: Ethernet interface

product: 79c970

vendor: Advanced Micro Devices

physical id: 0

bus info: pci@0000:02:00.0

logical name: eth4

version: 10

serial: 00:0c:29:77:15:9e

size: 1GB/s

capacity: 1GB/s

width: 32 bits

clock: 33MHz

capabilities: bus_master ethernet physical logical tp 1000bt-fd

configuration: autonegotiation=off broadcast=yes driver=vmxnet driverversion=2.0.1.1 duplex=full firmware=N/A ip=172.16.75.128 latency=64 link=yes maxlatency=255 mingnt=6 module=vmxnet multicast=yes port=twisted pair speed=1GB/s

I see the problem every one to three days. When it occurs, DNS queries often time out, but it's not just DNS; network traffic in general is very slow, and many requests time out. For example, I might load a web page, and the DNS query on the hostname will succeed, but the web page's HTML will take forever to load, and some of the images embedded in it will fail to load (i.e. their loads will time out after a while, and the browser will show a "broken image" icon in their place).

At the time I captured these packets, only Finder, VMware Fusion, and Terminal were running in Mac OS X. Only Firefox and GNOME Terminal were running in Ubuntu. Well, plus the packet capturers and the usual daemons.

Let me know if I can provide any additional information.

Regards,

Myk Melez

0 Kudos
kate_ward
Contributor
Contributor
Jump to solution

Included are some dumps from me.

host:

- MacBook Pro 3.1; OS X 10.5.6; VMWare Fusion 2.0.2

- en1 (wifi): 00:1b:63:c9:e8:0d 172.18.0.128/24 (dhcp) 2001:470:1f09:71d:21b:63ff:fec9:e80d/64 (dhcp) fe80::21b:63ff:fec9:e80d/64

- vmnet8: 192.168.44.1/24, 192.168.44.2/24

- uname -a: Darwin kward-mac.ie.corp.forestent.com 9.6.0 Darwin Kernel Version 9.6.0: Mon Nov 24 17:37:00 PST 2008; root:xnu-1228.9.59~1/RELEASE_I386 i386

- /etc/resolv.conf:

domain corp.forestent.com
nameserver 172.18.0.8
nameserver 208.67.222.222

vm: Ubuntu 8.04 Server 32-bit

- eth0: 00:0c:29:3e:04:25 192.168.44.3/24 (static) fe80::20c:29ff:fe3e:425/64

- uname -a: Linux ubuntu804s 2.6.24-23-server #1 SMP Mon Jan 26 00:55:21 UTC 2009 i686 GNU/Linux

- /etc/resolv.conf:

search localdomain
nameserver 192.168.44.2

pcap's grabbed with:

- host: sudo tcpdump -i en1 -w host.pcap

- vm: sudo tcpdump -i eth0 -w vm.pcap

- vmnet: sudo /Library/Application\ Support/VMware\ Fusion/vmnet-sniffer -e -w /tmp/vmnet.pcap vmnet8

As with the previous commenter, this happens roughly every 1-3 days. Once it starts, all networking is pretty slow to the VM. I have experienced this with Fusion 2.0.0-2.0.2, and IIRC 1.x as well (but that was a while ago).

A bit more that I have tried (each type of request at least 3x, host/vm is where the command was run):

- dig slashdot.org @192.168.44.1 -- host: no vm: no

- dig slashdot.org @192.168.44.2 -- host: no vm: no

- dig +tcp slashdot.org @192.168.44.2 -- host: host unreachable vm: connection refused

- dig slashdot.org @172.18.0.8 (my internal dns server) -- host: yes vm: works once, then subsequently fails

- dig +tcp slashdot.org @172.18.0.8 -- host: yes vm: yes (eventually and slow, but subsequent are fast and repeatable)

- dig slashdot.org @172.18.0.8 (immediately after +tcp) -- vm: works once, then subsequently fails

I was surprised to see that only UDP requests were allowed against the VM interface (192.168.44.2) as DNS requires TCP once the replies become too large. Also DNS can use either TCP or UDP for transport, although UDP is generally preferred.

From RFC 2181:

9. The TC (truncated) header bit \\ The TC bit should be set in responses only when an RRSet is required \\ as a part of the response, but could not be included in its entirety. \\ The TC bit should not be set merely because some extra information \\ could have been included, but there was insufficient room. This \\ includes the results of additional section processing. In such cases \\ the entire RRSet that will not fit in the response should be omitted, \\ and the reply sent as is, with the TC bit clear. If the recipient of \\ the reply needs the omitted data, it can construct a query for that \\ data and send that separately. \\ Where TC is set, the partial RRSet that would not completely fit may \\ be left in the response. When a DNS client receives a reply with TC \\ set, it should ignore that response, and query again, using a \\ mechanism, such as a TCP connection, that will permit larger replies. \\

Looking forward to a fix! Let me know if I can help with anything else.

- kate

0 Kudos
uair01
Contributor
Contributor
Jump to solution

Just for your information.

I'm running a small stable of Windows OS systems in VMWare Desktop 6.5.0 build 118116 and I experience the same problem when using NAT.

All was well for several weeks :smileyalert: and now no guest with NAT configuration will work anymore.

The host is a Vista Home Premium and the guests that experience the problem are:

Freebsd, Suse 11, Windows XP, Windows 2003, Windows 7, Windows 2008 and Windows Vista.

I'll try to collect some traces.

0 Kudos
notivan
Contributor
Contributor
Jump to solution

I just ran into this bug for the first time since the Beta tests of fusion 1.0. Here's the trace you wanted. Hope it helps.

0 Kudos
noetus
Contributor
Contributor
Jump to solution

I am not sure if I am getting this problem or a different one. Can you tell me so I can give you the correct dumps?

I can browse the internet using site names as well as IPs, and in general there are no browsing issues from guest.

However, I have always had intermittent connection problems from host to guest and vice versa, that affect host-guest access and file sharing, though not VMWare shared folders or internet sharing. When this error occurs, I can see the host/guest name on the respective guest/host network connections, but not connect to them. A connection error would always be generated, both from host side and from guest side (Windows 2003 server).

Today I had this problem, and discovered I can connect host-guest using IP address, just not using the VM name from the host, or the Mac name from the guest.

It appears to be this bug, but affecting access only within the VM network, not to the internet.

0 Kudos
admin
Immortal
Immortal
Jump to solution

I don't really understand the intricacies of the bug, but reading the internal comments, I think it would affect all connections in the guest, not just those to the host. Plus, it should be fixed in Fusion 2.0.5, so if you're using that and still seeing this behavior, it's probably different.

0 Kudos