Re: NTP broken after ESXi 7u3 upgrade - Page 5

actyler1001 · ‎10-24-2021

NTP time sync appears to have broken after the 7U3 upgrade. Anyone else run into this and have suggestions on how I might fix? I've deleted and re-created the service, checked FW policy, tried different servers... Nothing helps.

Original Build: VMware ESXi, 7.0.2, 17867351

New Build: VMware ESXi, 7.0.3, 18644231

mikejroberts · ‎06-29-2022

Yes, it is still broken in update 3e.

GHMitchell · ‎06-29-2022

I would add to your list @LabMasterBeta , if you don't mind.

1. NTP was broken along with a long list of other "low-level" functionality upon initial release of 7.0. We all saw the following releases, a-c, come and go with mention of this issue, being pulled off the download site and fully removed from circulation, but with no fix for NTP. This indicates that the issue is visible but not a priority.

2. I have personally received 2 responses from VMWare Employees that instructed me to create a ticket with support to elevate this issue specific to me. I can not do that without purchasing a support contract. We own minimal licensing but not support, so I have to wait for one of you to lodge a complaint

Suggesting a work-around is often welcome in situations like this, but at this point most of us had to find a way to "silence" this error or take a chance that the wrong person notices, or worse, our servers take random "naps". I appreciate that readers of this thread will try to help, often without reading the entire thread, and therefore will offer unrelated, already mentioned, or less than ideal stop-gap measures as band-aids. Unfortunately, doing so will almost always mean that if there is a fix, reversing the band-aids will be one more unpredictable variable in getting our systems back to a healthy state. Right now, it appears that a large portion of the 7.x user base is running with a piece of tape over the timekeeping warning light. If/when the Devs at VMWare get this "fixed", we will each have to find a way to reverse any stop-gap techniques we used to get this far. I think at this point, anyone that is aware of this issue has given up hope for a true fix and is anticipating a clean re-install once a true fix is released. That is pretty scary for those with large environments.

I don't want to drag the Engineers at VMWare. They produce a product that we all depend on, that usually works as intended, and has become the industry standard for virtualization all over the world. To keep that position, however, is very difficult. More so if they continue to introduce "features" without fixing existing issues.

TL:DR- Fix the timekeeping please

actyler1001 · ‎06-29-2022

So I am the guy that started this thread back in October 2021. Unreal that VMware has not addressed it. The vSphere 7 product is definitely crap. VMware needs to extend support for vSphere 6.7 at least another year.

PhillipA · ‎07-04-2022

Hi everyone, I have received a response to my support case.
(don’t shoot the messenger) 🙂

This alert is known issue on 7.0.3 builds (19898904 &19482537) that has no impact on the production environment as it's a false alert.

Resolution

The resolution for the issue is said to be provided in ESXi 7.0.3 P06 accordingly.

We expect the fix to be released this quarter or the first month in Q3.

But as you know and due to the testing phases here this might take time to ensure that the issue does not happen again, So this should take time to be confirmed.

Kinnison · ‎07-04-2022

Hi,

IMHO and AFAIK,

With the introduction of version 7.0U3 and until version 7.0U3c disabling the "time service event monitoring" was a "workaround" made necessary to prevent the HOSTD process crashing when an event related to the loss of time synchronization occurred, We all know the consequences given by the failure of the aforementioned process.
This article described the specific problem: https://kb.vmware.com/s/article/86283, The hostd service in ESXi 7.0U3 crashes due to memory corruption (86283).

Nowadays disabsabling that option means that no time synchronization events are logged and that the vCenter object will show the error object of @ actyler1001's original post, but otherwise it doesn't fix anything. Its using one of the retired ESXi / vCenter version a problem equivalent "to flip a coin".

As a consequence of the prolonged inaccessibility to the internet due to ISP outages and some other work on my electrical system which lasted several hours, I noticed that in the moment in which a time server referenced by its FQDN name cannot be resolved and at the same time it is not reachable, The NTP service tends to produce such high "offset" and "jitter" values up to the point to reject all the configured time servers, even those referenced by their IP addresses, despite thoese were being perfectly reachable.

To say, in my case Internet was not available from 12:50 am until (approximately) 7:00 pm, Meanwhile all my ESXi hosts have logged these events (seconds more or less):

system clock no longer synchronized to upstream time servers, Warning, 06/23/2022, 3:15:47 PM
system clock synchronized to upstream time servers, Warning, 06/23/2022, 4:43:29 PM
system clock no longer synchronized to upstream time servers, Warning, 06/23/2022, 5:43:30 PM
system clock synchronized to upstream time servers, Warning, 06/23/2022, 9:06:19 PM

Actually the NTP service really lost time sync as stated in the "event", but to be sure I have to check "by hand host by host". Thus, in this sense I agree with the views expressed by others in saying that the NTP service is somehow broken. What then, in the case of my IT context, the temporary loss of synchronization of the time does not cause insurmountable problems is another matter, but it is annoying all the same.

Regard,
Ferdinando

jlanders · ‎07-05-2022

Ferdinando,

How many upstream time sources are in your ESXi NTP configuration? Are you using your internet provider's time sources or NTP pool at pool.ntp.org?

Kinnison · ‎07-06-2022

@jlanders, good morning,

Well,

In the beginning for time synchronization I was referring to the the NTP pool provided by "pool.ntp.org" in my country, which are in the number of three and obviously refrenced by their FQDN name. As I said earlier, if for some reason they were unavailable the host process would crash in less than a minute. and so I did exactly as (I actually got there by myself) reported in the article I mentioned.

That said, currently all of my core systems /service including ESXi hosts, vCenter object, domain controller, DNS server etc. use no less than three time sources. The first two sources are stratum 2 public NTP servers referenced by their FQDN name, the third source is my network switch and its referenced by its IP address, It can provide a reliable time reference even when not synchronized to any upstream time server because the correct time persists both after a restart or after a prolonged shutdown (it support time-based ACL), by doing so I have greatly reduced a possible problem.

The latter, references three time sources (all stratum 1), over internet and refenced exclusively through their IP address and also use as a time reference its internal clock (hardware with battery).

But I cannot rely on the vCenter object to be aware of it, because as in my previous post I reported some recorded events but, from the "point of view" of the vCenter object UI, still nothing happened: Network Time Protocol, Running for less than a month and last time sync 06/23/2022, 8:59:31 PM (local time), and by the way there is a "refresh" button that at the moment "refresh" nothing.

The point of my speech is that, as I think I have already said, I can tolerate losing time synchronization without any particular problems even for quite a long time but I like to be somehow sure, without necessarily having to deal with the "host by host command line", when, sooner or later, things will (re)start to work out as they should.

Regards,
Ferdinando

Edit: Removed things already said and unnecessarily repeating.

jlanders · ‎07-12-2022

Ferdinando,

Thanks for the information. Can you tell us exactly which version of ESXi and vCenter you're running?

Please see KB article 1022196 for more information on determining ESXi and vCenter build numbers.

Kinnison · ‎07-12-2022

@jlanders, good morning,

Currently in my small IT infrastructure for both ESXi hosts and vCenter object I'm running at the 7.0U3e build level.

Regards,
Ferdinando

KuotaiDavidSu · ‎07-28-2022

I upgraded to 7.03f from 7.03e. This 7.03f released 07/12/2022 seems fix the problem.

TPGOPI007 · ‎07-28-2022

That's not true. I upgraded to VMware ESXi, 7.0.3, 20036589, and the issue still persist.

joseronquillo · ‎07-28-2022

I installed VMware ESXi, 7.0.3, 20036589 from VMware ESXi, 7.0.2, 17630552 with no issues on a dell R740 and R730

checked NTP service 1st it was not sync I run a refresh and test then it worked.

TPGOPI007 · ‎07-28-2022

Try a reboot of the host and see if the warning comes back 😀

jlanders · ‎07-28-2022

Ferdinando,

ESXi 7.0U3e has a fix the DNS name resolution issue. There was problem when a NTP server's hostname couldn't be resolved. This could happen, for example, when the NTP pool no longer resolves hostnames to IP addresses because the upstream servers are unavailable or not providing accurate time. The ntppool.org web page has hints for handling this for countries with small number of time servers in the NTP pool.

However, if you're still seeing this problem on ESXi 7.0Ue, please let us know and we can investigate further.

On VC 7.0U3e, when you click on refresh, you should see the "Host date and time" change. This may take a second or two. Is this not working for you? Or is the problem that the "Time services is currently not synchronized." is not cleared?

If the error is the second problem, can you tells or post a image of what the "Test Services" text area shows? If it shows "Configuration is working normally" at the top, does it continue work while you hit the "RE-RUN" button a number of times?

Kinnison · ‎07-29-2022

@jlanders, good morning,

First of all, thank you for your feedback,

Let's start by saying that I approached the problem "in my own way", referencing each time source with their IP address and not their FQDN name, but then I added some options to the configuration to prefer my local time source and to speed up the initial synchronization ("prefer" and "iBurst"). Supported or not, the fact is that I have remedied this problem (or at least mitigated to the point of not having to deal with it seriously anymore).

I probably should have thought about it first.

As far as I'm concerned I don't use (at least when I can I avoid them) the "pools" of the ntp.org project, they work, but more than once it happened to me to detect time discrepancies a little too high.

Let's talk for a moment about the vCenter object, with the introduction of version 7.0u3e the "refresh" button actually translates into an update of the "Host date and time" field, it's just a matter of waiting a few seconds, with the introduction of version 7.0u3f the warning about the "Time service is currently not synchronized" clear itself when time get actually synchronized. Obviously, after these changes of mine to the configuration of the NTP service, testing the service shows that the time sources "cannot be resolved", I simply ignore this circumstance.

Speaking instead of ESXi 7.0U3e I think it is positive that the iteration between the NTP service and the resolution of FQDN names has been corrected. Of course I cannot verify it for the reasons I have explained above.

The point of my speech was that I wanted to illustrate to you what was happening, in a small computer context like mine (perhaps similar to that of many others), but since it was obvious that something was not working I have long since managed, right or wrong supported or less.

Regards,
Ferdinando

A note: To date my small IT infrastructure is at level 7.0U3e for what concerns the ESXi hosts and 7.0U3g for the vCenter object.

KuotaiDavidSu · ‎07-29-2022

With 7.0.3f, I follow your suggestion and reboot the host. The issue did show up but after couple minutes I refresh the time service. The warning message went away.

LabMasterBeta · ‎07-29-2022

I also use IP's (not FQDN) with ESXi hosts providing time to VCSA.

My primary for NTP is always: time.nist.gov

Only if Secondary is required, then I use: pool.ntp.org

I've still had my ongoing multiple clean-install issues described in my details prior posts, which are still not mentioned in Release Notes as fixed.

I think it's unwise to disregard transitory errors, they usually mean something is failing and then potentially-succeeding or even worse, being suppressed from GUI on reattempts.

jlanders · ‎07-29-2022

Ferdinando,

Thanks for the update. Please let us know if you have other issues and we'll help you get them resolved. We appreciate your honest and helpful feedback and will use it to make your vSphere experience better.

The current VC UI is confusing because there are two separate concepts on the same Time Configuration page: "setting" the system clock and "synchronizing" the clock. Having the clock "set" is critically important for elements of the virtual infrastructure including communication security and event monitoring. Having the clock "synchronized" is critical for keeping the clock "set" over the operational period of the virtual infrastructure. For many customers, the operational period is measured in years, so it's critical to have a periodic updates from trusted time sources.

When the time service starts, the system clock gets "set" quickly using one of the provided time sources. Time "synchronization" takes a few minutes longer because clock adjustments get selected so that they can be slowly applied to keep the system clock from jumping abruptly forward or backward. Most customer inquiries about time involve questions about "synchronization" so the UI was designed to give a picture of exactly what is happening on the ESXi host without logging in via the command line shell.

Our new cluster wide configuration infrastructure will make all of the NTP options such as "iburst" available without needing to edit local files on ESXi hosts. The options will carry forward across reboots and software updates. Unfortunately, there's no way to enter them into the Time Configuration page without making the UI more complex.

Kinnison · ‎07-29-2022

Hi,

That by "testing the service" the generated report says that a time server referenced by its IP address cannot be resolved due to the NTP options I have considered adding, a non-random choice, I take the liberty of ignoring it. I wouldn't allow myself to do this if I hadn't implemented other equally effective methods of monitoring the "health" of the "NTP service", in any case I do not suppress them.

@jlanders

Actually, on latest ESXi build you can add option like "iBurst" or "prefer" in the NTP configuration page related to time server without the need to "manually edit" configuration files and act via the command line. As I "prefer" my local time source, from the point of view of the vCenter "Time Synchronization Services test" NTP "is in sync" in the same moment in which the aforementioned time source has been marked as "selected", that is within seconds.

Having said that, I really thank you for your offer of support, but as I wrote I have already taken an adequate remedy for the problem.

Regards,
Ferdinando

jlanders · ‎07-29-2022

Yes, we discovered the same error about resolving the NTP host name when "testing the service". It's been fixed internally, but I don't know when the fix will appear in a public release. We also fixed an issue internally when more than one option gets specified on the NTP "server" line (i.e. server x.x.x.x minpoll 4 maxpoll 4 iburst) in ntp.conf. Some of the options were getting dropped when "ntp.conf" was reloaded into the configuration engine. That was also fixed, but I don't have visibility yet on when it will be available to you.

Glad to hear things are working now. As always, let us know if you run into problems. We're happy to help you.

All

NTP broken after ESXi 7u3 upgrade

ESXi 7

ESXi 7.0.2