VMware Communities
cernvm
Contributor
Contributor

Caching DNS behind Fusion's NAT

I want to include a caching name server in an appliance (BIND 9.3.4). The /etc/resolv.conf points to localhost and the caching BIND server forwards every request to the DHCP-assigned DNS server, in case of Fusion's NAT to 172.16.8.2. Fusion's name server seems to act a little odd. First, it overwrites all the TTLs to 5 seconds. What is the reason for doing so?

But where it breaks is with IPv6 queries. Programs like ntpd try to get an AAAA record first, and if that fails they try to find an A record. When asked for a non-existent AAAA record, the response from 172.16.8.2 somehow poisons the BIND cache: in the next 5 seconds, a request for A also returns just a CNAME, i.e. the program cannot resolve the host. I tried to forward to the upstream DNS server instead, which works fine. Also VirtualBox's NAT DNS server works. Any ideas?

Tags (3)
Reply
0 Kudos
6 Replies
EugeneKim
Contributor
Contributor

I just noticed the same thing on my VMware Fusion 3, with the exactly same symptoms.

To begin demonstrating the problem, the DNS label "purple.the-7.net" holds the following resource records (RRs):

$ dig @ns1.the-7.net purple.the-7.net IN ANY +norecurse +vc +noall +answer

; <<>> DiG 9.6.1-P1 <<>> @ns1.the-7.net purple.the-7.net IN ANY +norecurse +vc +noall +answer
; (2 servers found)
;; global options: +cmd
purple.the-7.net.	300	IN	A	64.71.156.44
purple.the-7.net.	300	IN	KEY	512 3 3 CL6UZhTjW3mcP7QP5dtOVD1AO0OHjHLhbVIU0JJXoxCt85nFNyx01r6q eGswFz05tWc/Mpuk+E3sybnt1shzJWLWLaiSTUoJC6+RszLNQfHQep2P GLiQqTbZUPZZ45trDuppON79Sl71WZZyy2u0FLSGrrV5tb6AvRgX32wE EOoRW2O9QR0LG0oQXbJZL3/WpTpd33kSs+8nyV+bW7BfjtsQqydcfNvV tOEdoPUBtu/q5bCqefmvyoowuTlQG9NHW73E8j0OQkEgeg1xlS++91Bg vkkTyONfUePIL81Q2+qEHZPOyg67KWtK+66z6qW3EUrQ+K13R/7ZZtMt s4uMw8eb+8UvsKOKF4YS9vRvgQu71BkXU4uAudJTSEgVjOQyZaj4XbYv vBfwwSU7u2RWrsKPB3kkohq7mZWPcbWF4cdOCnrecJQEG+Q9POFsdG/U x7eoOtMoQs6UX4kFTTrZlL7MQV4Gw738Caoq6cWIM6xAuEReFJjgJqZt /7SNXV/P6SsRVAPDS6OPr4UgdDhxv9EUYOiL
purple.the-7.net.	0	IN	AAAA	2001:470:1f01:622::c
purple.the-7.net.	300	IN	A6	64 ::c colo1-net.the-7.net.
$ 

As seen above, the label holds no record, for instance, of the type SRV.

Now, when a correctly operating nameserver (10.0.0.1 in this case) is queried for a label with a nonexistent type, the response should include no "answer" records:

$ dig @10.0.0.1 purple.the-7.net IN SRV 

; <<>> DiG 9.6.1-P1 <<>> @10.0.0.1 purple.the-7.net IN SRV
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 44166
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 0

;; QUESTION SECTION:
;purple.the-7.net.		IN	SRV

;; AUTHORITY SECTION:
the-7.net.		60	IN	SOA	ns1.the-7.net. hostmaster.the-7.net. 2006090954 10800 3600 604800 60

;; Query time: 281 msec
;; SERVER: 10.0.0.1#53(10.0.0.1)
;; WHEN: Fri Jan  8 18:11:53 2010
;; MSG SIZE  rcvd: 85

$ 

However, VMware Fusion's DNS proxy (192.168.240.2) behaves differently:

$ dig @192.168.240.2 purple.the-7.net IN SRV

; <<>> DiG 9.6.1-P1 <<>> @192.168.240.2 purple.the-7.net IN SRV
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 45144
;; flags: qr aa; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;purple.the-7.net.		IN	SRV

;; ANSWER SECTION:
purple.the-7.net.	5	IN	CNAME	purple.the-7.net.
purple.the-7.net.	5	IN	A	64.71.156.44

;; Query time: 60 msec
;; SERVER: 192.168.240.2#53(192.168.240.2)
;; WHEN: Fri Jan  8 18:13:14 2010
;; MSG SIZE  rcvd: 80

$ 

Right now this problem causes my FreeBSD guest to emit spurious warnings:

$ grep 'AAAA' /var/log/messages | tail  
Jan  8 18:07:55 blue firefox-bin: gethostby*.getanswer: asked for "www.bind9.net IN AAAA", got type "A"
Jan  8 18:07:56 blue firefox-bin: gethostby*.getanswer: asked for "www.zytrax.com IN AAAA", got type "A"
Jan  8 18:07:56 blue firefox-bin: gethostby*.getanswer: asked for "www.faqs.org IN AAAA", got type "A"
Jan  8 18:08:13 blue firefox-bin: gethostby*.getanswer: asked for "ftp.is.co.za IN AAAA", got type "A"
Jan  8 18:08:23 blue firefox-bin: gethostby*.getanswer: asked for "www.dnssec-tools.org IN AAAA", got type "A"
Jan  8 18:08:23 blue firefox-bin: gethostby*.getanswer: asked for "www.freesoft.org IN AAAA", got type "A"
Jan  8 18:08:23 blue firefox-bin: gethostby*.getanswer: asked for "dnsjava.org IN AAAA", got type "A"
Jan  8 18:08:23 blue firefox-bin: gethostby*.getanswer: asked for "dnsruby.rubyforge.org IN AAAA", got type "A"
Jan  8 18:08:23 blue firefox-bin: gethostby*.getanswer: asked for "lists.isc.org IN AAAA", got type "A"
Jan  8 18:18:53 blue firefox-bin: gethostby*.getanswer: asked for "versioncheck.addons.mozilla.org IN AAAA", got type "A"
$ 

Here, Firefox is trying to resolve those domain names into IPv6 address records (AAAA), and the resolver library (called by Firefox) is aggravated by the fact that the DNS server (VMware's proxy) insists upon returning records of a wrong type (A).

Could we expect this problem to be fixed in a future version, hopefully soon?

Thanks,

Eugene

P.S. By the way, the DNS records shown there are live examples under my administration, not just fictitious ones; developers are encouraged to make experimental queries. Smiley Happy

Reply
0 Kudos
rcardona2k
Immortal
Immortal

I reproduced the same results you presented. Short of a fix in the dns proxy, you can also reconfigure VMware dhcpd to return an upstream DNS server or if you roam, you can select a public DNS provider like OpenDNS or Google DNS.

This is the section I modified in /Library/Application Support/VMware Fusion/vmnet8:

subnet 172.16.208.0 netmask 255.255.255.0 {
	range 172.16.208.128 172.16.208.254;
	option broadcast-address 172.16.208.255;
	option domain-name-servers 208.67.222.222;
	option domain-name localdomain;
	default-lease-time 1800;                # default is 30 minutes
	max-lease-time 7200;                    # default is 2 hours
	option routers 172.16.208.2;
}

To take effect, I restarted Fusion nat services with vmnet-cli --stop and --start as root in /Library/Application Support/VMware Fusion. After renewing my DHCP client lease, /etc/resolv.conf had my reconfigured DNS server and dig reports the correct results.

Reply
0 Kudos
petr
VMware Employee
VMware Employee

Can you verify that adding

[dns]
prohibitHostLookup = 1

to "/Library/Application Support/VMware Fusion/vmnet8/nat.conf" & restarting natd fixes things for your setup?

Message was edited by petr to add around so it is not link somewhere...

Reply
0 Kudos
EugeneKim
Contributor
Contributor

Yes, the problem is solved by adding those lines, mostly. The only remaining bug is the TTL rewrite (it is fixed at 5 seconds for some reason); although it does violate the DNS standards, I don't think it will cause serious problems in practice.

Thank you,

Eugene

Reply
0 Kudos
petr
VMware Employee
VMware Employee

Are you sure it happens with prohibitHostLookup set? 5 seconds is used only when natd is inventing reply altogether - either when prohibitHostLookup is not set, and request contained ".localdomain" suffix, or if prohibitHostLookup is not set, and response received had zero ancount. If prohibitHostLookup is set, you should get exact record host's res_nsend() returns for request you sent from guest.

Reply
0 Kudos
EugeneKim
Contributor
Contributor

Yes, it does happen with prohibitHostLookup set. In fact, the TTL seems rewritten in all the records returned by the VMware's proxy (192.168.240.2):

$ dig @192.168.240.2 purple.the-7.net IN SRV # to make sure prohibitHostLookup is working; purple.the-7.net has no SRV records

; <<>> DiG 9.6.1-P1 <<>> @192.168.240.2 purple.the-7.net IN SRV
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 27180
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 0

;; QUESTION SECTION:
;purple.the-7.net.		IN	SRV

;; AUTHORITY SECTION:
the-7.net.		5	IN	SOA	ns1.the-7.net. hostmaster.the-7.net. 2006090954 10800 3600 604800 60

;; Query time: 3 msec
;; SERVER: 192.168.240.2#53(192.168.240.2)
;; WHEN: Wed Jan 13 23:40:08 2010
;; MSG SIZE  rcvd: 85

$ dig @192.168.240.2 google.com IN A

; <<>> DiG 9.6.1-P1 <<>> @192.168.240.2 google.com IN A
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 55425
;; flags: qr rd ra; QUERY: 1, ANSWER: 6, AUTHORITY: 4, ADDITIONAL: 0

;; QUESTION SECTION:
;google.com.			IN	A

;; ANSWER SECTION:
google.com.		5	IN	A	74.125.19.103
google.com.		5	IN	A	74.125.19.104
google.com.		5	IN	A	74.125.19.105
google.com.		5	IN	A	74.125.19.106
google.com.		5	IN	A	74.125.19.147
google.com.		5	IN	A	74.125.19.99

;; AUTHORITY SECTION:
google.com.		5	IN	NS	ns1.google.com.
google.com.		5	IN	NS	ns3.google.com.
google.com.		5	IN	NS	ns4.google.com.
google.com.		5	IN	NS	ns2.google.com.

;; Query time: 4 msec
;; SERVER: 192.168.240.2#53(192.168.240.2)
;; WHEN: Wed Jan 13 23:40:13 2010
;; MSG SIZE  rcvd: 196

$ dig @192.168.240.2 google.com IN A

; <<>> DiG 9.6.1-P1 <<>> @192.168.240.2 google.com IN A
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 45467
;; flags: qr rd ra; QUERY: 1, ANSWER: 6, AUTHORITY: 4, ADDITIONAL: 0

;; QUESTION SECTION:
;google.com.			IN	A

;; ANSWER SECTION:
google.com.		5	IN	A	74.125.19.99
google.com.		5	IN	A	74.125.19.103
google.com.		5	IN	A	74.125.19.104
google.com.		5	IN	A	74.125.19.105
google.com.		5	IN	A	74.125.19.106
google.com.		5	IN	A	74.125.19.147

;; AUTHORITY SECTION:
google.com.		5	IN	NS	ns2.google.com.
google.com.		5	IN	NS	ns3.google.com.
google.com.		5	IN	NS	ns4.google.com.
google.com.		5	IN	NS	ns1.google.com.

;; Query time: 3 msec
;; SERVER: 192.168.240.2#53(192.168.240.2)
;; WHEN: Wed Jan 13 23:40:15 2010
;; MSG SIZE  rcvd: 196

$ dig @10.0.0.1 google.com IN A

; <<>> DiG 9.6.1-P1 <<>> @10.0.0.1 google.com IN A
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 62337
;; flags: qr rd ra; QUERY: 1, ANSWER: 6, AUTHORITY: 4, ADDITIONAL: 0

;; QUESTION SECTION:
;google.com.			IN	A

;; ANSWER SECTION:
google.com.		264	IN	A	74.125.19.147
google.com.		264	IN	A	74.125.19.99
google.com.		264	IN	A	74.125.19.103
google.com.		264	IN	A	74.125.19.104
google.com.		264	IN	A	74.125.19.105
google.com.		264	IN	A	74.125.19.106

;; AUTHORITY SECTION:
google.com.		155100	IN	NS	ns1.google.com.
google.com.		155100	IN	NS	ns3.google.com.
google.com.		155100	IN	NS	ns4.google.com.
google.com.		155100	IN	NS	ns2.google.com.

;; Query time: 2 msec
;; SERVER: 10.0.0.1#53(10.0.0.1)
;; WHEN: Wed Jan 13 23:40:23 2010
;; MSG SIZE  rcvd: 196

$ dig @10.0.0.1 google.com IN A

; <<>> DiG 9.6.1-P1 <<>> @10.0.0.1 google.com IN A
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 12864
;; flags: qr rd ra; QUERY: 1, ANSWER: 6, AUTHORITY: 4, ADDITIONAL: 0

;; QUESTION SECTION:
;google.com.			IN	A

;; ANSWER SECTION:
google.com.		262	IN	A	74.125.19.106
google.com.		262	IN	A	74.125.19.147
google.com.		262	IN	A	74.125.19.99
google.com.		262	IN	A	74.125.19.103
google.com.		262	IN	A	74.125.19.104
google.com.		262	IN	A	74.125.19.105

;; AUTHORITY SECTION:
google.com.		155098	IN	NS	ns3.google.com.
google.com.		155098	IN	NS	ns1.google.com.
google.com.		155098	IN	NS	ns2.google.com.
google.com.		155098	IN	NS	ns4.google.com.

;; Query time: 4 msec
;; SERVER: 10.0.0.1#53(10.0.0.1)
;; WHEN: Wed Jan 13 23:40:25 2010
;; MSG SIZE  rcvd: 196

$ 

As shown above, TTL is fixed at 5 seconds in all the results from VMware's proxy, but decrements in real time in the results from the upstream DNS server as expected.

On your point about VMware Fusion's verbatim use of the result returned by res_nsend(): I wrote a simple program to check if Mac OS X's implementation of res_nsend() is the culprit that rewrites TTL, but res_nsend() seems just fine:

$ cat res_send_test.c
#include <sys/types.h>
#include <netinet/in.h>
#include <arpa/nameser.h>
#include <err.h>
#include <resolv.h>
#include <stdio.h>
#include <string.h>
#include <sysexits.h>

int
main
(int argc, char **argv, char **envp)
{
    unsigned char query[1024], reply[1024];
    int query_len, reply_len;
    struct __res_state res0;
    res_state res = &res0;

    if (argc != 2)
	errx(EX_USAGE, "usage: res_send_test <domain name>");

    memset(res, 0, sizeof(*res));

    res_ninit(res);
    res->options |= RES_DEBUG; /* this parse-prints result received from server */

    query_len = res_mkquery(ns_o_query, /* op */
			    argv[1], ns_c_in, ns_t_a, /* dname, class, type */
			    NULL, 0, NULL, /* data, datalen, newrr_in */
			    query, sizeof(query) /* buf, buflen */);
    if (query_len == -1)
	errx(EX_UNAVAILABLE, "res_mkquery() failed");

    reply_len = res_nsend(res, query, query_len, reply, sizeof(reply));
    if (reply_len == -1)
	errx(EX_UNAVAILABLE, "res_nsend() failed");

    return 0;
}
$ cc -g -O0 -Wall -Werror res_send_test.c -o res_send_test -lresolv
$ ./res_send_test google.com
;; res_send()
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 48117
;; flags: rd; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0
;;	google.com, type = A, class = IN
;; Querying server (# 1) address = 10.0.0.1
;; new DG socket
;; got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 48117
;; flags: qr rd ra; QUERY: 1, ANSWER: 6, AUTHORITY: 4, ADDITIONAL: 0
;;	google.com, type = A, class = IN
google.com.		3m23s IN A	74.125.19.104
google.com.		3m23s IN A	74.125.19.105
google.com.		3m23s IN A	74.125.19.106
google.com.		3m23s IN A	74.125.19.147
google.com.		3m23s IN A	74.125.19.99
google.com.		3m23s IN A	74.125.19.103
google.com.		1d18h16m14s IN NS  ns4.google.com.
google.com.		1d18h16m14s IN NS  ns3.google.com.
google.com.		1d18h16m14s IN NS  ns1.google.com.
google.com.		1d18h16m14s IN NS  ns2.google.com.
$ 

Hope this helps,

Eugene

Reply
0 Kudos