VMware Cloud Community
sflanders
Commander
Commander

vCenter Server / DNS + NFS Datastore Issue

I current have a VI3 environment with a vCenter Server 2.5 U4 VM utilizing AD with AD dynamic DNS (one master, on slave). In addition, I have ESX 3.5 U4 systems, which are installed using EDA and have dedicated DNS servers (one master, one slave). The ESX servers connect to NFS datastores via hostname (not IP). Recently, the master DNS server for the ESX hosts had a kernel panic and went offline, during this time it was discovered that when trying to browse a NFS datastore while connected through the VI Client to the vCenter Server instance that only 'Searching Datastore...' appears instead of the folders on the datastore. Using the VI Client directly to the ESX host did not run into this problem. Once the master DNS server was brought back online, browsing the datastore through vCenter Server began working again. The slave DNS was taken down to test and did not cause the issue, however every time the master DNS server was brought down the issue re-appeared.

An ESX host was reinstalled using EDA, but this time the slave DNS server was specified as the primary nameserver. This time, whenever the slave DNS server was brought down the issue appeared, but bringing down the master did not have the same effect. Based on these tests, it was concluded that DNS is not the issue here and instead it appears to be VMware (specifically ESX) related. Anyone every seen this before or know where to look? I have already searched the filesystem, but the DNS IP addresses do not show up in too many places (i.e. /etc/resolv.conf). At this point I am starting to think a binary file with the information resides somewhere...any help would be greatly appreciated.

Hope this helps! === If you find this information useful, please award points for "correct" or "helpful". ===
Reply
0 Kudos
10 Replies
sflanders
Commander
Commander

To add a little more information, it appears all DNS resolution from the VC to the ESX hosts fails when the ESX hosts primary DNS server is unavailable. It appears as though the VC does not even bother to check the secondary DNS server or that it is unaware the primary is done. Again, everything works fine if directly connected to the ESX host...any help would be greatly appreciated.

Hope this helps! === If you find this information useful, please award points for "correct" or "helpful". ===
Reply
0 Kudos
sflanders
Commander
Commander

I have confirmed that installing by hand (i.e. not using EDA) yields the same result (i.e. the primary works, but if it goes down VC does not check the secondary on the ESX host)...at this point I am running out of ideas.

Hope this helps! === If you find this information useful, please award points for "correct" or "helpful". ===
Reply
0 Kudos
kjb007
Immortal
Immortal

What does your /etc/resolv.conf look like on your ESX host?

-KjB

VMware vExpert

Don't forget to leave points for helpful/correct posts.

vExpert/VCP/VCAP vmwise.com / @vmwise -KjB
Reply
0 Kudos
sflanders
Commander
Commander

search sc.dc.example.com

nameserver 10.20.30.53

nameserver 10.20.30.54

It should be noticed that example.com is maintained by a different master slave pair (i.e. 10.2.3.53 and 10.2.3.54):

lookup file bind

nameserver 4.2.2.1

nameserver 4.2.2.2

nameserver 4.2.2.3

Also, the VC uses AD DNS:

search ad.example.com

nameserver 10.40.50.53

nameserver 10.40.50.54

nslookup works fine on the ESX host and VC. Problem is only experienced when ESX primary DNS goes down and only on VC.

Hope this helps! === If you find this information useful, please award points for "correct" or "helpful". ===
Reply
0 Kudos
kjb007
Immortal
Immortal

So, is the sc.dc.example.com domain delegated to the nameservers you have listed in your resolv.conf? Also, are both vc and esx in the same DNS domain? It definitely appears like name resolution is breaking, but with several nameservers in play, and a possible disjointed namespace may factor in to causing confusion as well.

-KjB

VMware vExpert

Don't forget to leave points for helpful/correct posts.

vExpert/VCP/VCAP vmwise.com / @vmwise -KjB
Reply
0 Kudos
sflanders
Commander
Commander

Every ESX is named esx01.sc.dc.example.com and configured with:

search sc.dc.example.com

nameserver 10.20.30.53

nameserver 10.20.30.54

The nameservers specified (10.20.30.53 and 10.20.30.54) are in a static master/slave configuration and 10.20.30.53 owns the sc.dc.example.com zone. All requests outside of sc.dc.example.com (i.e. dc.example.com) are forwarded (via a forwarders) to the domain slave servers (10.2.3.53 and 10.2.3.54). The domain slave servers have a master (10.2.3.52) which owns the example.com zone and dc.example.com zones (again static). The domain slaves are also external resolvers (thus using 4.2.2.1, 4.2.2.2, and 4.2.2.3). vCenter (vc01.ad.example.com) is part of AD (ad.example.com) and uses Microsoft's dynamic DNS, which is a separate sudomain zone of the domain master (example.com). The master domain server zone files have entries to get to the subdomains as follows:

= example.com zone file

ad IN NS ns1.ad.example.com.

IN NS ns2.ad.example.com.

ns1.ad IN A 10.40.50.53

ns2.ad IN A 10.40.50.54

= dc.example.com zone file

sc IN NS ns1.sc.dc.example.com.

IN NS ns2.sc.dc.example.com.

ns1.sc IN A 10.20.30.53

ns2.sc IN A 10.20.30.54

I know the DNS implementation is very complicated (unfortunately this cannot be changed) and initially I thought DNS was the problem. However, when the primary ESX nameserver goes down everything works fine from the VI Client directly to the ESX host. The problem only exist when using the VI Client to the vCenter instance. Even with the primary ESX nameserver down, from the vCenter I am able to do nslookups without issue and thus why I do not think DNS is an issue. In addition, I install ESX by default with the primary being 10.20.30.53 and secondary being 10.20.30.54. If I install ESX with the primary being 10.20.30.54 and the secondary being 10.20.30.53 (exact opposite) then the problem is only displayed when the slave nameserver (10.20.30.54) goes down (this proves in my mind that master and slave DNS are working properly and thus why I think DNS is fine and the problem is actually VMware related).

Hope this helps! === If you find this information useful, please award points for "correct" or "helpful". ===
Reply
0 Kudos
kjb007
Immortal
Immortal

Ok, is your NFS server configured as an FQDN, or as a regular hostname, and which domain does it live in? Have you tried to add in all of the domains in your search list on both the windows box and your esx hosts? Also, when you have problems, does resolution work for short and fully qualified domain names. I ask because I have a disjointed namespace, and my primary DNS has had problems, but I've not encountered your scenario yet, although I don't have as complicated a DNS infrastructure as yours, and I my search string includes all of the domains in use by the servers.

-KjB

VMware vExpert

Don't forget to leave points for helpful/correct posts.

vExpert/VCP/VCAP vmwise.com / @vmwise -KjB
Reply
0 Kudos
sflanders
Commander
Commander

NFS datastores are configured with FQDN and they reside in the dc.example.com zone. Neither the ESX hosts nor the vCenter instance had all subdomains in the search path (only the subdomain they were part of) thus the use of FQDN. I added all search paths per your suggestion, which made short names work, however this did not solve my problem. With this change and testing in the failed state, short and FQDN work.

I have tried a few more tests:

  • add the NFS datastore information into /etc/hosts on the vCenter - this does not help

  • add the NFS datastore information into /etc/hosts on the ESX host - this does solve the problem (new workaround, but not scalable)

  • changed the vCenter DNS to be identical to the ESX DNS - this does not help

  • swapped the primary and secondary nameservers on the ESX host, reboot - now when the new primary goes down the problem is experienced

To be sure this is not a configuration issue, I have tried this on physical and virtual vCenter instances, with and without AD, on Windows Server 2003 SP2 and 2003 R2 SP2, on vCenter Server 2.5 U3 and U4 (only thing left to test is vSphere, but I do not currently have an environment up to test this). All experience the exact same problem. In addition, I have run tcpdump from the ESX DNS servers and I can see the query to the primary, but the query to the secondary is significantly delayed (anywhere from 10 seconds to several minutes). If/When the query to the secondary DNS server is made things work, but the next query results in the exact same behavior.

I really appreciate the help with this complicated and frustrating problem. Any additional feedback/suggestions would be greatly appreciated.

Hope this helps! === If you find this information useful, please award points for "correct" or "helpful". ===
Reply
0 Kudos
kjb007
Immortal
Immortal

Looks like the default timeouts are too long in returning a response. You can set timeouts for nameserver queries in your resolv.conf. Try the below entries in your resolv.conf. Timeout sets a timeout for the nameserver resonse. Attempts is how many timesit will try to contact nameserver before moving on to the next, rotate will round-robin between the nameservers. The below values are the default, which would explain the 10 second wait before the 2nd nameserver is tried. These changes should also not require a reboot.

options timeout:5

options attempts:2

options rotate

-KjB

VMware vExpert

Don't forget to leave points for helpful/correct posts.

vExpert/VCP/VCAP vmwise.com / @vmwise -KjB
Reply
0 Kudos
sflanders
Commander
Commander

Sorry for not posting sooner, work has been busy and I was unable to test your recommendation. I tried, but unfortunately I received the same result. I even tried rebooting the ESX host to be sure. I guess it makes sense this did not work because remotely connected to the ESX host always worked it was only through VC that it stopped if the master DNS servers went down. I am trying to setup a vSphere environment to see if it experiences the same problem. I will report back once it is up.

Hope this helps! === If you find this information useful, please award points for "correct" or "helpful". ===
Reply
0 Kudos