VMware Cloud Community
barnette08
Expert
Expert

Multi-Cell NTP Sync

Has anyone run into the issue where their cells will fail out of sync over time?  I noticed this morning that I couldn't connect to the VMRC console and figured I would check NTP on the cells since this has happened before.  After looking at the cells, I noticed that they were off by 10 seconds from each other...  They are both connecting to the same NTP server so to fix it usually I will stop the ntpd service and force a re-sync to the server..This has happened a few times over the past few months and I was wondering if anyone knows a way to keep this from happening?

0 Kudos
10 Replies
_morpheus_
Expert
Expert

What does your ntp.conf look like?

0 Kudos
JayhawkEric
Expert
Expert

We had the same issue even though we had the same servers set in our ntp.conf file and the hosts they were running on were the same processor and had the same time.  We ended up having to create a cron job to force a sync every hour as after 4-5 days they would be off by about 10 minutes.

-Eric

VCP5-DV twitter - @ericblee6 blog - http://vEric.me
barnette08
Expert
Expert

morpheus I have pasted the ntp.conf config file below and just change my IP to A.B.C.D, let me know what your thoughts are.  thanks and happy 4th!

# For more information about this file, see the man pages

# ntp.conf(5), ntp_acc(5), ntp_auth(5), ntp_clock(5), ntp_misc(5), ntp_mon(5).

driftfile /var/lib/ntp/drift

# Permit time synchronization with our time source, but do not

# permit the source to query or modify the service on this system.

restrict default kod nomodify notrap nopeer noquery

restrict -6 default kod nomodify notrap nopeer noquery

# Permit all access over the loopback interface.  This could

# be tightened as well, but to do so would effect some of

# the administrative functions.

restrict 127.0.0.1

restrict -6 ::1

# Hosts on local network are less restricted.

#restrict 192.168.1.0 mask 255.255.255.0 nomodify notrap

# Use public servers from the pool.ntp.org project.

# Please consider joining the pool (http://www.pool.ntp.org/join.html).

#server 0.rhel.pool.ntp.org

#server 1.rhel.pool.ntp.org

#server 2.rhel.pool.ntp.org

server A.B.C.D

#broadcast 192.168.1.255 autokey # broadcast server

#broadcastclient # broadcast client

#broadcast 224.0.1.1 autokey # multicast server

#multicastclient 224.0.1.1 # multicast client

#manycastserver 239.255.254.254 # manycast server

#manycastclient 239.255.254.254 autokey # manycast client

# Undisciplined Local Clock. This is a fake driver intended for backup

# and when no outside source of synchronized time is available.

#server 127.127.1.0 # local clock

#fudge 127.127.1.0 stratum 10

# Enable public key cryptography.

#crypto

includefile /etc/ntp/crypto/pw

# Key file containing the keys and key identifiers used when operating

# with symmetric key cryptography.

keys /etc/ntp/keys

# Specify the key identifiers which are trusted.

#trustedkey 4 8 42

# Specify the key identifier to use with the ntpdc utility.

#requestkey 8

# Specify the key identifier to use with the ntpq utility.

#controlkey 8

# Enable writing of statistics records.

#statistics clockstats cryptostats loopstats peerstats

0 Kudos
nirvy
Commander
Commander

Whats the output of ntpq -p ?

0 Kudos
barnette08
Expert
Expert

Here is the ntpq -p output:

remote       refid  st t when poll reach   delay   offset  jitter

==============================================================================

*FQDN-ntp .GPS.        1 u  982 1024  3770.6840.050   0.051
0 Kudos
nirvy
Commander
Commander

There doesn't seem to be anything wrong with your NTP config or with the configured peer, so maybe the local clock in one or more of the cells is occasionally drifting too far too quickly and going beyond the range where NTP can correct it.


Adding the following to (the top of) your cells ntp.conf might be of help:

tinker panic 0

This setting allows ntpd to accept any offset and is considered a VMware timekeeping best practice for linux guests -- http://kb.vmware.com/kb/1006427


More extreme measures might be to use a different local clocksource or to reduce the maxpoll frequency to force ntp syncs to occur more often (adjusting maxpoll is usually not required/recommended).

Cheers

Mark

0 Kudos
IamTHEvilONE
Immortal
Immortal

I would also mention to avoid two sources of truth.

I have seen people use NTP in the Guest & use VMware Tools ability to sync to the ESXi host.

This can become a battle for truth, as the VMware tools might correct to host time.  If host time is too far off, then the drift settings may not allow it to correct.

you'll also notice problems in logging as time stamps move around, or jobs not working correctly.

So just make sure to do either NTP in the guest on its own OR no guest NTP, VMware Tools to ESXi host, then Host sync to NTP time servers.

Personally, I prefer the latter of those two options as it reduces the amount of configuration going into a specific virtual machine.

barnette08
Expert
Expert

After your comments I took a look at the esx hosts and noticed that the esx hosts in the mgmt cluster weren't completely in sync, and were pointing to the domain controller which is pointed to the ntp server.  The vCD cells are pointing directly to the ntp server for ntp, and the esx hosts were pointing to the domain controller.  I think this could have been causing some issue.  I will give it a few months and see what happens. Thanks for all the help guys.  

0 Kudos
IamTHEvilONE
Immortal
Immortal

If the Domain controller IS a VM on an ESXi host and tools syncs time ... you are creating a circular dependency.  that's another really odd case to be aware of.

0 Kudos
IamTHEvilONE
Immortal
Immortal

but yes, it sounds like sync to host and the ESXi host has time issue.  The VMs might have vmotioned between hosts, got a new time from Host, and then be off when you look at the guest itself.

0 Kudos