actyler1001
Enthusiast
Enthusiast

NTP broken after ESXi 7u3 upgrade

NTP time sync appears to have broken after the 7U3 upgrade.  Anyone else run into this and have suggestions on how I might fix?  I've deleted and re-created the service, checked FW policy, tried different servers...  Nothing helps.

Original Build: VMware ESXi, 7.0.2, 17867351

actyler1001_1-1635089620144.png

 

New Build: VMware ESXi, 7.0.3, 18644231

actyler1001_0-1635089532365.png

 

Labels (2)
101 Replies
GHMitchell
Contributor
Contributor

I just finished converting 2 ancient whiteboxes to VMs and moved our whole domain onto VMWare. I "assumed" the best course of action would be to install the latest release of ESXi. It took me 2 weeks of banging my head against the wall trying to install vsphere, only to discover that NTP was the problem. Manually configure the time and volia, it works. How can I trust our entire infrastructure to a dev team that can't set the time on the d@#$ VCR?

 

I would have been better off installing 6 and waiting for 7 to be stable. That would be my advice to anyone performing a new deployment today. Install the last stable release of 6 or 7.0 and wait for the rest of us to muddle through until 7.4.

 

0 Kudos
actyler555
Enthusiast
Enthusiast

"How can I trust our entire infrastructure"

Exactly!  VMware is proving over and over again they are not worthy of this trust.  I really hope that this turns around, but I've been disgusted at their development performance with the vSphere 7 product.

0 Kudos
Jensational
Contributor
Contributor

Still a problem in vSphere ESXi 7.0 Update 3c  (build-19193900)

[root@pmi-esx01:~] cat /var/log/vmkernel.log  | grep NTPClock
2022-01-31T04:23:01.998Z cpu25:2099355)WARNING: NTPClock: 1712: system clock synchronized to upstream time servers
2022-01-31T06:33:44.006Z cpu0:2098149)WARNING: NTPClock: 644: system clock apparently no longer synchronized to upstream time servers
2022-01-31T08:12:15.998Z cpu31:2099355)WARNING: NTPClock: 1712: system clock synchronized to upstream time servers
0:00:00:05.721 cpu0:2097152)Initializing InitVMKernel: (131/186) NTPClock_Init ...
2022-01-31T13:28:14.000Z cpu0:2097152)SysInitTable: 112: Finished sysInit step: NTPClock_Init in 1411 us.
2022-01-31T13:28:42.091Z cpu11:2098486)WARNING: NTPClock: 1764: system clock synchronized to upstream time servers
2022-01-31T13:41:50.997Z cpu21:2098486)WARNING: NTPClock: 1764: system clock synchronized to upstream time servers
2022-01-31T14:26:39.998Z cpu20:2098486)WARNING: NTPClock: 1457: system clock stepped to 1643639201.001221000, no longer synchronized to upstream time servers
2022-01-31T14:26:41.001Z cpu20:2098486)WARNING: NTPClock: 1764: system clock synchronized to upstream time servers
2022-01-31T14:59:56.776Z cpu6:2109353)WARNING: NTPClock: 1457: system clock stepped to 1643641196.000498000, no longer synchronized to upstream time servers
2022-01-31T15:12:00.996Z cpu20:2109371)WARNING: NTPClock: 1764: system clock synchronized to upstream time servers
2022-01-31T19:14:14.003Z cpu0:2097416)WARNING: NTPClock: 680: system clock apparently no longer synchronized to upstream time servers
2022-01-31T19:42:36.000Z cpu22:2109371)WARNING: NTPClock: 1764: system clock synchronized to upstream time servers
2022-01-31T20:29:23.003Z cpu35:2109371)WARNING: NTPClock: 1457: system clock stepped to 1643660962.001395000, no longer synchronized to upstream time servers
2022-01-31T20:29:22.002Z cpu35:2109371)WARNING: NTPClock: 1764: system clock synchronized to upstream time servers
2022-01-31T23:06:47.006Z cpu0:2098007)WARNING: NTPClock: 680: system clock apparently no longer synchronized to upstream time servers
2022-02-01T02:05:56.989Z cpu24:2109371)WARNING: NTPClock: 1764: system clock synchronized to upstream time servers
2022-02-01T03:05:58.009Z cpu0:2104433)WARNING: NTPClock: 680: system clock apparently no longer synchronized to upstream time servers
2022-02-01T04:43:46.001Z cpu20:2109371)WARNING: NTPClock: 1764: system clock synchronized to upstream time servers
2022-02-01T05:43:47.006Z cpu0:2097319)WARNING: NTPClock: 680: system clock apparently no longer synchronized to upstream time servers
2022-02-01T08:14:52.999Z cpu20:2109371)WARNING: NTPClock: 1764: system clock synchronized to upstream time servers
2022-02-01T09:15:15.872Z cpu37:2126587)WARNING: NTPClock: 1457: system clock stepped to 1643706915.000534000, no longer synchronized to upstream time servers
2022-02-01T09:29:46.001Z cpu2:2126606)WARNING: NTPClock: 1764: system clock synchronized to upstream time servers

[root@pmi-esx01:~] ntpq -pn
     remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
 10.XXX.0.1      .LOCL.           1 u   15   64    1    0.959  -85.494   0.000
 10.XXX.0.11     .LOCL.           1 u   14   64    1    1.578  -85.761   0.000
 

 

0 Kudos
GHMitchell
Contributor
Contributor

I regret contradicting an earnest post, but this is not a benign problem. My 2012 Server Essentials (not installed by me, installed before I got here, just sayin.) DC reboots randomly from 6 hours to 6 days due to the NTP issue. I had to define the time locally on the server in powershell. NTD is basically worthless.

 

On top of that, we got a new release of 7.0.3 that did not fix the issue.

 

A false positive can be safely ignored. This is not that.

0 Kudos
CoffeeBlackest
Contributor
Contributor

Any updates from any other folks on this?  We receive alerts from time to time about the NTP service failing.  We see the service showing uptime within minutes or hours depending when we check.  Running 7u3c, on these hosts.  We have no reported issues from older 6.5 and 6.7 hosts.

0 Kudos
gapostolou
Contributor
Contributor

This worked for me, in that the time has now synchronised 😀 but the results of the test still say failed. At least the time is correct now and will allow for me to progress config!

devakumar
VMware Employee
VMware Employee

Hello,

 

Please open an SR with VMware support with all ESXi host logs of 7.0 U3c where you are seeing NTP related alerts along with screenshots etc., to get it checked.

0 Kudos
LabMasterBeta
Enthusiast
Enthusiast

Buried in the ESXi 7.0 Update-3c Release Notes, FINALLY an acknowledgement of the ongoing NTP issues:

"NTP optional configurations do not persist on ESXi host reboot.  When you set up optional configurations for NTP by using ESXCLI commands, the settings might not persist after the ESXi host reboots.  This issue is resolved in this release. The fix makes sure that optional configurations are restored into the local cache from ConfigStore during ESXi host bootup."

**HOWEVER**  After rebooting the Host its working, BUT if you DARE to go into the ESXi host on Web interface directly (not vCenter), and touch or even just verify the NTP settings, and click Save, then..... Kaboom! 

Issue #1 of 2:

It STOPS the NTP service, you CANNOT click on the Actions to Restart NTP, and basically... you must Reboot the Host to get NTP back again. 

EVEN STRANGER, if you change the settings for NTP (again, on the ESXi web interface directly), it will REVERT to local host clock and LOOK like it FAILED to Save NTP settings, BUT... then if you reboot, the changes ARE indeed read upon Host Reboot and the NTP service WILL start using the settings that LOOKED like they failed to Save in the GUI (meaning the GUI will show it correctly), but cannot be read until Reboot is done for ESXi host.

AND AGAIN, then, if you DARE touch the NTP settings, it will AGAIN revert to local host clock, STOP the NTP service, and stay that way in the GUI until you reboot the ESXi host. 

Issue #2 of 2 (exasperates issue#1):

On the ESXi web interface for Host, navigate to Manage > System > Time & date > Actions button.  It does NOTHING!  Tried in Chrome and Firefox (the "latest" versions as of 2/20/2022).

This compounds the issue #1 above, because you cannot Stop or Start the NTP service from this Web GUI at all without being able to click that Actions button that USED to work great.

To reiterate Summary, as I am in total disbelief:

ESXi 7.03c Generic host from VMware direct ISO download only, Fully supported hardware auto-recognized all hardware in Host, there is no vCenter so thats eliminated as a reason, all to prove this is an Out-Of-The-Box ONGOING BUG with NTP on ESXi 7.03c...  No driver fiddling or other VIB updates etc. Just TRY to setup NTP with local ESXi host only (no vCenter), the only 1-change was the HostName in DCUI upon installed first-reboot, and it STILL has this same NTP problem!

In short, I can't believe it....

I've never seen such a critical service as NTP be so broken in vSphere for so long, even after VMware THOUGHT they "fixed" it in the release notes for 7.0 Update-3c... But Nope!

I have a wild-guess theory: 

Starting with ESXi 6.7 and additionally 7.0u3 adding more new features in general to 7.0u2, VMware is compressing and encrypting startup files based on TPM 2.0 (and if does not exist generates similar in a locally encrypted file).  I am taking a wild guess there are some low-level scripts core to the NTP service, and potentially other services, that have not yet been 100% patched or rewritten to support this heightened security with local esxi hosts using crypto plus the New Features added in specifically the ESXi 7.0 Update-3.....remaining unfixed for Update 3a, Update 3b, and tried to fix but still partially broken in 7.0 Update 3c....

And remember: Clean ESXi 7.0u3c Install, No VIB updates, No vCenter complexities; Still NTP issues!

Thoughts on this?

LabMasterBeta
Enthusiast
Enthusiast

@devakumar  What do you mean by 7.0 U3c NTP related alerts? NTP is simply just still broken as shown by trying to use the ESXi 7.0 u3c web GUI; please read my post on the Reproducible ESXi 7.0 u3c behavior of NTP that we are complaining about still being broken in-detail. Thank you!

0 Kudos
Kinnison
Enthusiast
Enthusiast

Hi,

@LabMasterBeta 

Just deployed an ESXi host versione 7.0U3c but I was able to configure NTP via WEB interface, changed IT multiple way and it persisted on reboot every time, but only when in the "service tab" ntpd was set to "start" (It take some time to "react"), by default is "stop".

Regards.

   

 

0 Kudos
LabMasterBeta
Enthusiast
Enthusiast

@Kinnison

1. At the Manage > System > Time & date > Actions button, does that button work for you?

It does nothing when I click it trying Google Chrome and Mozilla Firefox.

2. If you change from System Clock to NTP, then from NTP to System Clock, and finally one last time System Clock back to NTP, does it Save and Work for you?

It does not Save my changes to use NTP, it reverts every time back to System Clock for Host.

Note:  I'm using a SuperMicro server on the VMware hardware compatibility list supported and listed by vSphere 7, no OEM VIB's required.

Thoughts?

0 Kudos
Kinnison
Enthusiast
Enthusiast

@LabMasterBeta 

Well, the "action buttons" in fact apparently does nothing with ESXi 7.0U3c and below, I don't remember exactly the release when that button "started" to be "useless" because I don't like to bother around something that does not work as intended / expected. 

 

The annoying part is that when NTP had been configured and set to start / stop with the host but, if in the "service tab" the state of the NTP client remain as "stopped" (the default) that status persist to subsequent reboot (NTP service does not start with the host) till set to start (then I can change the NTP setting at runtime without issues).

 

In my "lab" the main system are a pair of DELL R730 set up to be a "replica" of a small, but real, production environment.

 

 

 

 

 

0 Kudos
LabMasterBeta
Enthusiast
Enthusiast

@Kinnison 

Thanks for the detailed clarification on the annoying NTP behavior you observed.

Just like with other OS's, there are always multiple ways in different ESXi interfaces to use certain feature(s).

So based on your reply, it sounds like you've validated the complaints being discussed in this form, and you simply use vSphere in a different manner/interface than I detailed - but have the same NTP issues/bug as those being discussed (in general).

In other words, hopefully these posts of the different ways to observe and prove these bug(s) with NTP on ESXi 7.0u3c, so they can finally be fixed by VMware once and for all - Sooner vs later!

 

0 Kudos
Kinnison
Enthusiast
Enthusiast

@LabMasterBeta 

Well,

 

IMHO the NTP service itself is not really broken but the way the managment tools intreract with it is debatable.

 

I was bitten by this (and other related issues in the middle):

NTP service auto start is not working in ESXi 7.0 (80189) (vmware.com)

The hostd service in ESXi 7.0U3 crashes due to memory corruption (86283) (vmware.com)

About the behaviour of the "action button" an article found on the web (a blog) reported what you noted dated April 2020. As we well known all build of ESXi 7.0U3 before the current one was retired.

Of course if an NTP time source is not as reliable as it should be, then is another kind of history.

Regards,

Ferdinando

 

 

CoffeeBlackest
Contributor
Contributor

For what its worth...we worked with both VMWare and with Nutanix on the issue i mentioned earlier.  Both have known issues with NTP, they may or may not be related.  The main issue we were able to identify on the VMWare side was that most of the hosts are having issues with the ntp service restarting and we didn't find a solution or even really the cause, but they're aware of the issue.  On the Nutanix side they have a conf file which after 7.u3c has been made read only (for most nutanix users this won't make much sense, but this doesn't seem to be an issue in earlier esxi versions 6.x), and even if i chmod it to rw, it changes back at 3am on most* of the cvms.  Both companies say they're working on the issues and were aware of them before i contacted them.

actyler555
Enthusiast
Enthusiast

Unreal to me that vSphere 7 isn't a stable product yet. vSphere 6.x goes EOL by the end of the year too.  I've been waiting for the bugs to get worked out of 7 before upgrade, doesn't look like that is ever going to happen.

0 Kudos
GHMitchell
Contributor
Contributor

Don't move until you have:

a) a healthy lab environment in which you have proved everything out, including simulating your network, iscsi, autostart sequence, etc

b) no other choice

 

In my case, I built a new vmware environment.

I used the trial license to build a simple 3 host vcenter essentials managed domain including a new 2019 DC, then purchased and applied a permanent VMWare Essentials License. I carefully created copies of 2 key production servers (one my PDC, a 2012 Essentials server, long story, don't ask) using vmware converter, and only put them in production after proper testing. I am not linking it here because it is dangerous, no longer supported, and I can't be held responsible for your results. It is also the only way I have ever successfully converted a live raid 5 array into a working VM. It can be done other ways, by others more skilled than I.

With 3 VMs in production, all on ESXi 7.0.3u, two identical Dell Poweredge R340's.

3 relatively small servers on 2 relatively beefy platforms. It ran fine for weeks of testing, and almost a month live in production.

And then my pdc shut itself down without warning. Boot it back up, check logs, the time suddenly differed by 5 hours from the other 2 servers. Check it out, the pdc was on a host that lost time. I lost hours that day.. I thought I had fixed it. Frankly, it was a lot of starting and stopping the ntp service using different means. Inside the web client (which I hate and wish very much was replaced by a working, locally installed client like it should be, but whatever) using ssh, bashing the service in cli, I couldn't tell if it was actually running or not because vcenter presented an error indicating that the service was not running. I didn't take a screenshot at the time, but if you are reading this you probably don't need one. This started at 10am during production. All of my CAD Designers (and I am one of them when things are working properly) got separated from their data and had to work offline without the vault.

There is more to the story, but this is already a wall of text. I will add that I did not pay for support, and as such am not allowed by the web portal to create a support ticket. Therefore it is not helpful for a vmware support technician, who means well, to suggest that I do that. Also, I have read thousands of words related to this issue, and have gone as far as to re-create my environment using the Dell specific image for my Servers. It is somehow related to MS implementation of TPM2, in a way no one is really talking about. I smell a zero-day here somewhere.

It is still there. NTP will drop, your host will report UTC, and down goes your domain. Just set the NTP server on your PDC to point to your router, or something else that runs firmware inside your domain, and move on to another problem. I'm sure it will be fixed by the time 8 is released.

0 Kudos
LabMasterBeta
Enthusiast
Enthusiast

I've done some more digging on the "how did this happen" question, and I have to wonder:

It seems vSphere 7.0's Update-3 added many new features; Making it a LOT more like a Feature Upgrade than a Maintenance Update - and this entire thread is based on NTP breaking starting at U3 (along with other issues introduced).

This explains to me and my own way of thinking as to why all of 7.0's U3, U3a, U3b, were all pulled upon discovery of its critical core flaw, and then made available again in 7.0 U3c...

vSphere 7.0's initial release at least had the basics like NTP working - It's those new U3 features injecting critical unforeseen issues like fundamental NTP.

So, I'll just speculate that an Update in general does not have the same level of regression testing that a full all-new release like 7.0's launch would be as heavily debugged, then beta tested, and scrutinized as a preview release prior to production recommendations. In addition, considering many customers were waiting for the first few Updates to arrive to trust 7.0 in production thinking it was only a Maintenance Update but in reality somewhat a Feature Upgrade this unfortunately has bitten everyone including VMware.

Looking through the release notes of the latest prior vSphere 7.0 Update-2e (released Feb 15th 2022), it does not have any mention whatsoever about NTP.. because as mentioned at the very top of this thread - NTP was never broken until U3, U3a, U3b, and attempted fix but still quasi-broken in U3c's release!

Thus IMHO it seems to maintain credibility, VMware has a few choices prior to October 15th 2022:

1.) Fix all fundamental issues like NTP for vSphere 7.0 Update/Upgrade-3 as an expedited priority.

2.) Extend vSphere 6.7's EOL/EOS date AGAIN  (like they already did on June 3rd 2020 for 11 months, now supported up to Oct 15th 2022).

3.) Post an advisory that customers using vSphere for critical infrastructure may want to consider the latest build of 7.0 U2 until said U3 issues are all 100% confirmed resolved OR 7.1 is launched with serious assurances.

Just my $0.02 speculation after all this wasted time on 7.0u3 reading KB's and Release Notes....

GHMitchell
Contributor
Contributor

@LabMasterBeta 

I agree with your fundamental points.

I would add, the user base of VMWare is constantly changing. A certain percentage of those users are power users that are lab testing every release to death. Another portion of those users are like me, with some level of mastery below expert, but aware enough of the fundamentals to make use of the free/nearly free platform. If this latter group does not do their research, they may assume that the latest release is the way to go, especially if the release ID is beyond x.0. I looked at 7.0.3 and thought, surely they have the major bugs worked out by now, and I was wrong.

I don't want to sound like I am beating up on the developer of a, by and large, superb product that is made available essentially for free. At the same time, if you find out that the backend time-keeper of a hardware emulator intended to support the bulk of business operations for millions of users around the world isn't working, maybe don't bury the lead. Tell people at the point of download, before implementation, that they need an alternative ntp for the foreseeable future. Now we are in a situation where 6 is at EOL, we can't trust 7, and are looking sideways at 8. If there were a clear alternative (don't say Hypervisor) I would take it.

Again, I am a grifter here. I don't manage 2000 hosts with HA and vSAN. I paid <$1k. But if this is happening to me, it must be affecting the big guys too.

0 Kudos
actyler555
Enthusiast
Enthusiast

Well said, both @LabMasterBeta and @GHMitchell ..