VMware Cloud Community
srodenburg
Expert
Expert

vSAN 7 U2 File Services - File Server DNS errors in Skyline Health

Hello,

I just deployed vSAN 7 U2 File Services in my Lab incl. AD config, custom OU (instead of the "computers" container) and everything works fine. I entered the static DNS records into DNS beforehand and checked that both forward and reverse records are all correct (no typos).

File Services is not configured to update DNS itself as the configured AD user (dedicated service account) has no rights to do so.

Everything works fine. NFS shares, SMB shares. The SMB shares can be managed, accessed, the whole AD integration etc. etc. all without any problems what so ever.

Still, the DNS error "One or more DNS server is not reachable or File server IP and FQDN not matching with DNS entries" stubbornly appears in Skyline Health. Everything else is green except the "DNS Lookup" error is in an error-state on all deployed File Services Nodes under "File Server Health" (the "Infrastructure Health" and "Share Health" checks are all green).

Surely I went over everything again. SSH'ed to the VCSA, ran nslookup, forced it to use the DNS servers I configured one-by-one and did forward and reverse tests on all File Services Nodes on all DNS servers. All good. No DNS replication issues. All DNS Servers are reachable, everything resolvable forward and back 100%, no typos.
All components like the ESXi Servers, vCenter, DNS Servers and the bunch of vSAN File Server VM's are all in the same subnet so no firewall issues (which I checked anyway just for good measure).

So what is Skyline health complaining about?

The only thing I noticed is that VCSA does not do short name lookups. It can only do FQDN lookups because VCSA itself has no search-domains (/etc/resolv.conf is devoid of such entries). So for example, it cannot resolve "vSANFS-01" at all but it can resolve "vSANFS-01.mycompany.com" forward and back. But as everything is always configured with FQDN's (I never use short names) this is unlikely the cause of the health-check error.

Is there some detailed error-log that I can dive into to find out why Skyline health keeps complaining about those DNS errors?

Reply
0 Kudos
24 Replies
paudieo
VMware Employee
VMware Employee

vsanhealth log is located on vCenter server at 

/var/log/vmware/vsan-health/vmware-vsan-health-service.log
look for the entry dnsLookupHealth

vCSA uses dns caching called dnsmasq, maybe worth restarting that service 

# systemctl status dnsmasq
you could also  add your search domain to /etc/resolv.conf as a troubleshooting step

 

Reply
0 Kudos
hughneale
Contributor
Contributor

I just had the same thing. I thought it was a DNS issue. However, after several hours it turned out that several of the File share virtual objects had disappeared.

Cluster -> Monitor -> vSAN -> virtual objects

I had 2 shares with an object status of "--" which was causing the file service to be in continuous state of failure. See attached picture.

 

Reply
0 Kudos
srodenburg
Expert
Expert

Hi Paudieo,

Added the search domain. nslookup can now resolve short names. But it did not help. Neither did restarting dnsmasq.

The file "vmware-vsan-health-service.log" does not contain the word "dns" at all.
Bt the file "vmware-vsan-health-summary-result.log" does. Here is the output of the 4 FileServices VM's  (these four lines are repeated many times but they are all identical:

 

 

root@vcenter01 [ /var/log/vmware/vsan-health ]# cat vmware-vsan-health-service.log | grep -i dns

root@vcenter01 [ /var/log/vmware/vsan-health ]# cat vmware-vsan-health-summary-result.log | grep -i dns

         FileServerConnectivity: Domain  IpAddress  Host  Network  DnsLookup  ActiveDirectory  Description
                                 (anonymous.local, 192.168.10.94, Host-265, Green, Red, Green, OneOrMoreDnsServerIsNotReachableOrFileServerIpAndFqdnNotMatchingWithDnsEntries.), (anonymous.local, 192.168.10.92, Host-201, Green, Red, Green, OneOrMoreDnsServerIsNotReachableOrFileServerIpAndFqdnNotMatchingWithDnsEntries.), 
                                 (anonymous.local, 192.168.10.93, Host-201, Green, Red, Green, OneOrMoreDnsServerIsNotReachableOrFileServerIpAndFqdnNotMatchingWithDnsEntries.), (anonymous.local, 192.168.10.91, Host-204, Green, Red, Green, OneOrMoreDnsServerIsNotReachableOrFileServerIpAndFqdnNotMatchingWithDnsEntries.), 
         FileServerConnectivity: Domain  IpAddress  Host  Network  DnsLookup  ActiveDirectory  Description
                                 (anonymous.local, 192.168.10.91, Host-204, Green, Red, Green, OneOrMoreDnsServerIsNotReachableOrFileServerIpAndFqdnNotMatchingWithDnsEntries.), (anonymous.local, 192.168.10.94, Host-265, Green, Red, Green, OneOrMoreDnsServerIsNotReachableOrFileServerIpAndFqdnNotMatchingWithDnsEntries.), 
                                 (anonymous.local, 192.168.10.92, Host-201, Green, Red, Green, OneOrMoreDnsServerIsNotReachableOrFileServerIpAndFqdnNotMatchingWithDnsEntries.), (anonymous.local, 192.168.10.93, Host-201, Green, Red, Green, OneOrMoreDnsServerIsNotReachableOrFileServerIpAndFqdnNotMatchingWithDnsEntries.),

 

 

Not really helpful...
If I disect that log file a bit more I see this:

1. line:
contains: "192.168.10.94, Host-265" and "192.168.10.92, Host-201"
2. line:
contains: "192.168.10.93, Host-201" and "192.168.10.91, Host-204"
3. line:
contains: "192.168.10.91, Host-204" and "192.168.10.94, Host-265"
4. line:
contains: "192.168.10.92, Host-201" and "192.168.10.93, Host-201"

So:
FS VM with IP 192.168.10.91 is "Host-204"
FS VM with IP 192.168.10.92 is "Host-201"
FS VM with IP 192.168.10.93 is "Host-201"
FS VM with IP 192.168.10.94 is "Host-265"
????
Host-201  is listed twice with different IP's.
Duhhhhh something ain't right...


Is it possible that the problem lies within those FileServices VM's themselves and not al all in VCSA? How can I test that?

 

Reply
0 Kudos
srodenburg
Expert
Expert

"

"I just had the same thing. I thought it was a DNS issue. However, after several hours it turned out that several of the File share virtual objects had disappeared.
Cluster -> Monitor -> vSAN -> virtual objects
I had 2 shares with an object status of "--" which was causing the file service to be in continuous state of failure."

Was it the exact same error as I'm having?
I have no missing objects. Everything works 100% correct. It starts to smell like a bug in the health-check.

Is there a way to test a FS VM (and its perspective of DNS) by logging into it and run stuff like nslookup inside of it? As all 4 FS VM's show the same error, they could be misconfigured somehow.

Reply
0 Kudos
V2Classic
Contributor
Contributor

We are experiencing the same problem.  The configuration had no errors while running on 7.01d however as soon as we updated our environment to 7.02 we are seeing the File Server DNS error.  All our SMB and NFS shares are working as expected. We are able to resolve the FQDN addresses that were assigned to the vSANFS nodes.  I think you may be correct in stating that this appears to be a bug with the health service.

Reply
0 Kudos
depping
Leadership
Leadership

you can login to the FS VM, but the IP config/dns etc is not assigned to then VM, but to the protocol stack container rather.

Reply
0 Kudos
depping
Leadership
Leadership

Are you running it in production @V2Classic or @srodenburg ? please do a log dump and file an SR and post the SR number here.

Reply
0 Kudos
V2Classic
Contributor
Contributor

I just opened an SR (21217034904) and uploaded the logs.  Let's see what comes of it

Reply
0 Kudos
depping
Leadership
Leadership

thanks, @paudieo is also trying to repro it in his lab 

Reply
0 Kudos
paudieo
VMware Employee
VMware Employee

I reproduced similar behaviour, but not precisely the same as what was reported here
I reproduced on upgrade to ESXi 7.0U2a 
as reported here my DNS appears correct, I also deleted and re-created my DNS records (it is dns after all 😀 )

However during troubleshooting I moved some hosts in and out of the vSAN cluster and the issue "appeared" to resolve itself

I have let the eng team know 

Thanks for filing the SR by the way

 

Reply
0 Kudos
srodenburg
Expert
Expert

It's the Lab in my company so not important. As V2Classic has the same issue, I guess we can wait for his support-case. I'll be available if supports wants to take a peak in the lab if they want another environment to snoop around in.

Reply
0 Kudos
anderel
Contributor
Contributor

Did you check if DNS is reachable from the file service subnet?

Make sure that the file service can resolve the IP addresses to the hostnames and vice versa.

Reply
0 Kudos
srodenburg
Expert
Expert

This lab lives completely in the same /24 subnet.
And as I stated, everything resolves properly back and forth. Fileservices has no problems interacting with Active Directory or anything else.

Reply
0 Kudos
srodenburg
Expert
Expert

Any news on this case from V2Classic?

Reply
0 Kudos
V2Classic
Contributor
Contributor

Not as of yet.  Support is still investigating.  The issue has been sent to the engineers for further investigation.

Reply
0 Kudos
srodenburg
Expert
Expert

Ok. If they have problems reproducing the issue and would like to peek around in a second environment which is also affected, they can gladly contact me. Might speed things up. Plus it's a lab and not production so they can play around without causing production outages or other things that upsets management 🙂

Reply
0 Kudos
srodenburg
Expert
Expert

Any news?

Reply
0 Kudos
V2Classic
Contributor
Contributor

I have uploaded the host logs and they are currently being reviewed.  We have a scheduled call for next week to see if we can determine how to address it.

Reply
0 Kudos
V2Classic
Contributor
Contributor

The issue is still being investigated.  In the mean time I have manage to get 4 out of the 8 appliances to respond.  We ran the following command on the shell of each appliance:

docker ps

The command on showed the docker container ID.  In one case we had 1 appliance showing 2 IDs. 

We then executed the following commnad:

docker stop "Container ID"

This moved the docker ID to another appliance.  We repeated the steps until all appliances showed only one container ID running. We then looked at the vSAN Health and noticed that 4 of the 8 systems were now reporting correctly. 

Reply
0 Kudos