Hello,
I just deployed vSAN 7 U2 File Services in my Lab incl. AD config, custom OU (instead of the "computers" container) and everything works fine. I entered the static DNS records into DNS beforehand and checked that both forward and reverse records are all correct (no typos).
File Services is not configured to update DNS itself as the configured AD user (dedicated service account) has no rights to do so.
Everything works fine. NFS shares, SMB shares. The SMB shares can be managed, accessed, the whole AD integration etc. etc. all without any problems what so ever.
Still, the DNS error "One or more DNS server is not reachable or File server IP and FQDN not matching with DNS entries" stubbornly appears in Skyline Health. Everything else is green except the "DNS Lookup" error is in an error-state on all deployed File Services Nodes under "File Server Health" (the "Infrastructure Health" and "Share Health" checks are all green).
Surely I went over everything again. SSH'ed to the VCSA, ran nslookup, forced it to use the DNS servers I configured one-by-one and did forward and reverse tests on all File Services Nodes on all DNS servers. All good. No DNS replication issues. All DNS Servers are reachable, everything resolvable forward and back 100%, no typos.
All components like the ESXi Servers, vCenter, DNS Servers and the bunch of vSAN File Server VM's are all in the same subnet so no firewall issues (which I checked anyway just for good measure).
So what is Skyline health complaining about?
The only thing I noticed is that VCSA does not do short name lookups. It can only do FQDN lookups because VCSA itself has no search-domains (/etc/resolv.conf is devoid of such entries). So for example, it cannot resolve "vSANFS-01" at all but it can resolve "vSANFS-01.mycompany.com" forward and back. But as everything is always configured with FQDN's (I never use short names) this is unlikely the cause of the health-check error.
Is there some detailed error-log that I can dive into to find out why Skyline health keeps complaining about those DNS errors?
vsanhealth log is located on vCenter server at
/var/log/vmware/vsan-health/vmware-vsan-health-service.log
look for the entry dnsLookupHealth
vCSA uses dns caching called dnsmasq, maybe worth restarting that service
# systemctl status dnsmasq
you could also add your search domain to /etc/resolv.conf as a troubleshooting step
I just had the same thing. I thought it was a DNS issue. However, after several hours it turned out that several of the File share virtual objects had disappeared.
Cluster -> Monitor -> vSAN -> virtual objects
I had 2 shares with an object status of "--" which was causing the file service to be in continuous state of failure. See attached picture.
Hi Paudieo,
Added the search domain. nslookup can now resolve short names. But it did not help. Neither did restarting dnsmasq.
The file "vmware-vsan-health-service.log" does not contain the word "dns" at all.
Bt the file "vmware-vsan-health-summary-result.log" does. Here is the output of the 4 FileServices VM's (these four lines are repeated many times but they are all identical:
root@vcenter01 [ /var/log/vmware/vsan-health ]# cat vmware-vsan-health-service.log | grep -i dns
root@vcenter01 [ /var/log/vmware/vsan-health ]# cat vmware-vsan-health-summary-result.log | grep -i dns
FileServerConnectivity: Domain IpAddress Host Network DnsLookup ActiveDirectory Description
(anonymous.local, 192.168.10.94, Host-265, Green, Red, Green, OneOrMoreDnsServerIsNotReachableOrFileServerIpAndFqdnNotMatchingWithDnsEntries.), (anonymous.local, 192.168.10.92, Host-201, Green, Red, Green, OneOrMoreDnsServerIsNotReachableOrFileServerIpAndFqdnNotMatchingWithDnsEntries.),
(anonymous.local, 192.168.10.93, Host-201, Green, Red, Green, OneOrMoreDnsServerIsNotReachableOrFileServerIpAndFqdnNotMatchingWithDnsEntries.), (anonymous.local, 192.168.10.91, Host-204, Green, Red, Green, OneOrMoreDnsServerIsNotReachableOrFileServerIpAndFqdnNotMatchingWithDnsEntries.),
FileServerConnectivity: Domain IpAddress Host Network DnsLookup ActiveDirectory Description
(anonymous.local, 192.168.10.91, Host-204, Green, Red, Green, OneOrMoreDnsServerIsNotReachableOrFileServerIpAndFqdnNotMatchingWithDnsEntries.), (anonymous.local, 192.168.10.94, Host-265, Green, Red, Green, OneOrMoreDnsServerIsNotReachableOrFileServerIpAndFqdnNotMatchingWithDnsEntries.),
(anonymous.local, 192.168.10.92, Host-201, Green, Red, Green, OneOrMoreDnsServerIsNotReachableOrFileServerIpAndFqdnNotMatchingWithDnsEntries.), (anonymous.local, 192.168.10.93, Host-201, Green, Red, Green, OneOrMoreDnsServerIsNotReachableOrFileServerIpAndFqdnNotMatchingWithDnsEntries.),
Not really helpful...
If I disect that log file a bit more I see this:
1. line:
contains: "192.168.10.94, Host-265" and "192.168.10.92, Host-201"
2. line:
contains: "192.168.10.93, Host-201" and "192.168.10.91, Host-204"
3. line:
contains: "192.168.10.91, Host-204" and "192.168.10.94, Host-265"
4. line:
contains: "192.168.10.92, Host-201" and "192.168.10.93, Host-201"
So:
FS VM with IP 192.168.10.91 is "Host-204"
FS VM with IP 192.168.10.92 is "Host-201"
FS VM with IP 192.168.10.93 is "Host-201"
FS VM with IP 192.168.10.94 is "Host-265"
????
Host-201 is listed twice with different IP's.
Duhhhhh something ain't right...
Is it possible that the problem lies within those FileServices VM's themselves and not al all in VCSA? How can I test that?
"
"I just had the same thing. I thought it was a DNS issue. However, after several hours it turned out that several of the File share virtual objects had disappeared.
Cluster -> Monitor -> vSAN -> virtual objects
I had 2 shares with an object status of "--" which was causing the file service to be in continuous state of failure."
Was it the exact same error as I'm having?
I have no missing objects. Everything works 100% correct. It starts to smell like a bug in the health-check.
Is there a way to test a FS VM (and its perspective of DNS) by logging into it and run stuff like nslookup inside of it? As all 4 FS VM's show the same error, they could be misconfigured somehow.
We are experiencing the same problem. The configuration had no errors while running on 7.01d however as soon as we updated our environment to 7.02 we are seeing the File Server DNS error. All our SMB and NFS shares are working as expected. We are able to resolve the FQDN addresses that were assigned to the vSANFS nodes. I think you may be correct in stating that this appears to be a bug with the health service.
you can login to the FS VM, but the IP config/dns etc is not assigned to then VM, but to the protocol stack container rather.
Are you running it in production @V2Classic or @srodenburg ? please do a log dump and file an SR and post the SR number here.
I just opened an SR (21217034904) and uploaded the logs. Let's see what comes of it
thanks, @paudieo is also trying to repro it in his lab
I reproduced similar behaviour, but not precisely the same as what was reported here
I reproduced on upgrade to ESXi 7.0U2a
as reported here my DNS appears correct, I also deleted and re-created my DNS records (it is dns after all 😀 )
However during troubleshooting I moved some hosts in and out of the vSAN cluster and the issue "appeared" to resolve itself
I have let the eng team know
Thanks for filing the SR by the way
It's the Lab in my company so not important. As V2Classic has the same issue, I guess we can wait for his support-case. I'll be available if supports wants to take a peak in the lab if they want another environment to snoop around in.
Did you check if DNS is reachable from the file service subnet?
Make sure that the file service can resolve the IP addresses to the hostnames and vice versa.
This lab lives completely in the same /24 subnet.
And as I stated, everything resolves properly back and forth. Fileservices has no problems interacting with Active Directory or anything else.
Any news on this case from V2Classic?
Not as of yet. Support is still investigating. The issue has been sent to the engineers for further investigation.
Ok. If they have problems reproducing the issue and would like to peek around in a second environment which is also affected, they can gladly contact me. Might speed things up. Plus it's a lab and not production so they can play around without causing production outages or other things that upsets management 🙂
Any news?
I have uploaded the host logs and they are currently being reviewed. We have a scheduled call for next week to see if we can determine how to address it.
The issue is still being investigated. In the mean time I have manage to get 4 out of the 8 appliances to respond. We ran the following command on the shell of each appliance:
docker ps
The command on showed the docker container ID. In one case we had 1 appliance showing 2 IDs.
We then executed the following commnad:
docker stop "Container ID"
This moved the docker ID to another appliance. We repeated the steps until all appliances showed only one container ID running. We then looked at the vSAN Health and noticed that 4 of the 8 systems were now reporting correctly.