frozenak
Enthusiast
Enthusiast

TKG Supervisor Cluster Constant Disk IO

Jump to solution

Is it expected for the supervisor cluster VMs to constantly be writing to disk at roughly 8-16MB/s? Each one is writing at that rate.

Spun up workload management in my homelab, and since the supervisor cluster was deployed there's constant writing to disk, with VMs reporting 4,000 write ops/sec and >30MB/s being written to disk by just these 3 VMs.

root@420631715fd538d484265a9a44e451cb [ /usr ]# /usr/bin/python3 /usr/bin/dstat
You did not select any stats, using -cdngy by default.
Color support is disabled as python-curses is not installed.
/usr/bin/dstat:515: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated since Python 3.3,and in 3.9 it will stop working
if isinstance(self.val[name], collections.Sequence) and not isinstance(self.val[name], six.string_types):
--total-cpu-usage-- -dsk/total- -net/total- ---paging-- ---system--
usr sys idl wai stl| read writ| recv send| in out | int csw
14 10 41 35 0| 210k 16M| 0 0 | 0 0 | 17k 47k
10 9 43 37 0| 0 16M| 27k 59k| 0 0 | 15k 50k
13 9 42 35 0| 0 14M| 39k 100k| 0 0 | 16k 45k
8 11 45 35 0| 0 15M| 13k 38k| 0 0 | 18k 48k
12 9 45 34 0| 0 15M| 35k 95k| 0 0 | 17k 49k
11 12 43 34 0| 0 15M| 29k 63k| 0 0 | 21k 43k

 

root@42068944986ef297cc64a8a2a234c0dd [ /usr/bin ]# /usr/bin/python3 /usr/bin/dstat
You did not select any stats, using -cdngy by default.
Color support is disabled as python-curses is not installed.
/usr/bin/dstat:515: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated since Python 3.3,and in 3.9 it will stop working
if isinstance(self.val[name], collections.Sequence) and not isinstance(self.val[name], six.string_types):
--total-cpu-usage-- -dsk/total- -net/total- ---paging-- ---system--
usr sys idl wai stl| read writ| recv send| in out | int csw
10 9 44 37 0| 162k 16M| 0 0 | 0 0 | 15k 41k
12 10 39 38 0| 0 16M| 16k 7118B| 0 0 | 16k 38k
7 7 44 41 0| 0 16M| 147k 58k| 0 0 | 17k 40k
8 8 45 39 0| 0 15M| 22k 9355B| 0 0 | 19k 41k
9 9 45 38 0| 0 16M| 137k 38k| 0 0 | 19k 40k
9 10 42 39 0| 0 16M| 28k 11k| 0 0 | 17k 36k
7 8 44 41 0| 0 16M| 56k 26k| 0 0 | 17k 39k
8 9 43 39 0| 0 16M| 22k 10k| 0 0 | 17k 37k

 

root@4206a24f8bfc148c68b68e56e223fa7f [ ~ ]# /usr/bin/python3 /usr/bin/dstat
You did not select any stats, using -cdngy by default.
Color support is disabled as python-curses is not installed.
/usr/bin/dstat:515: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated since Python 3.3,and in 3.9 it will stop working
if isinstance(self.val[name], collections.Sequence) and not isinstance(self.val[name], six.string_types):
--total-cpu-usage-- -dsk/total- -net/total- ---paging-- ---system--
usr sys idl wai stl| read writ| recv send| in out | int csw
20 16 38 26 0| 163k 9567k| 0 0 | 0 0 | 10k 27k
18 15 40 27 0| 0 10M| 50k 21k| 0 0 |7850 26k
18 17 38 27 0| 0 9292k| 23k 11k| 0 0 |8936 25k
15 15 41 29 0| 0 8816k| 52k 23k| 0 0 | 10k 26k
14 14 41 31 0| 0 9464k| 29k 15k| 0 0 | 11k 26k
23 15 34 28 0| 0 9700k| 50k 22k| 0 0 | 12k 27k
17 17 38 28 0| 0 9024k| 25k 10k| 0 0 | 11k 26k
18 12 40 30 0| 0 9696k| 52k 24k| 0 0 | 11k 26k
18 17 38 27 0| 0 8968k| 25k 13k| 0 0 | 11k 26k
21 18 33 27 0| 0 9312k| 42k 18k| 0 0 | 13k 26k
36 24 20 20 0| 0 10M| 22k 10k| 0 0 | 15k 33k
15 13 43 28 0| 0 8524k| 55k 24k| 0 0 | 10k 25k
17 19 37 27 0| 0 8888k| 27k 15k| 0 0 | 11k 27k
14 16 45 25 0| 0 9880k| 46k 20k| 0 0 | 11k 33k
13 17 45 26 0| 0 10M|6903B 4904B| 0 0 |9881 32k

Validated by reviewing statistics on my storage and VM statistics.

https://imgur.com/a/7xqKz1Q

Have not yet figured out how to determine what/where it's writing to, but... that's seems a bit much no? 16MB/sec is a lot of text if it's just logging, and it's all in writes.

I've also deployed these numerous times with the same results. When I used "large" deployment I think it was a touch lower.

0 Kudos
1 Solution

Accepted Solutions
depping
Leadership
Leadership

Issue has been confirmed in our engineering environment as well, it is being investigated and hopefully fixed in a future release. If you run this in production I would recommend filing an SR for it and submitting all details, that way it can be attached to the engineering issue.

View solution in original post

21 Replies
faebudo
Enthusiast
Enthusiast

Yes, I also experience about 500TBW per year (2.5PBW over 5 years, that will chew through your vSAN Cache Tier) per supervisorcontrolplaneVM in 7.0U1

About 90-95% of the traffic is caused by

/fluent-bit/bin/fluent-bit -c /fluent-bit/etc/fluent-bit.conf --plugin /fluent-bit/bin/out_syslog.so
 
You can check it by running pidstat -d
 
Also the CPU load on the supervisorcontrolplanevm is sometimes very high but immediately goes down when loggin in via SSH:
 
Screenshot 2021-01-25 203947.png

 

Edit:

When checking the logs of the fluentbit Pods I can see that they each generate about 700 lines per second logs like this:

[2021/01/25 19:52:04] [debug] [input:tail:tail.0] inode=262901 events: IN_MODIFY
[2021/01/25 19:52:04] [debug] [input:tail:tail.0] inode=262901 events: IN_MODIFY
[2021/01/25 19:52:04] [debug] [input:tail:tail.0] inode=262901 events: IN_MODIFY
[2021/01/25 19:52:04] [debug] [input:tail:tail.0] inode=262901 events: IN_MODIFY
[2021/01/25 19:52:04] [debug] [input:tail:tail.0] inode=262901 events: IN_MODIFY
[2021/01/25 19:52:04] [debug] [input:tail:tail.0] inode=262901 events: IN_MODIFY

0 Kudos
faebudo
Enthusiast
Enthusiast

I found the file that is written is /var/log/vmware/fluentbit/consolidated.log and the error is as follows:

kube.var.log.containers.vsphere-csi-controller-7c646978db-5q5g9_vmware-system-csi_vsphere-syncer-d5cb3d3dca1abff78161f01569ad7c3971f0cdc62b98029cda99890b412d4359.log: [1611607763.935094, {"log":"2021-01-25T20:49:23.935062778Z stderr F {\"level\":\"error\",\"time\":\"2021-01-25T20:49:23.934908315Z\",\"caller\":\"storagepool/listener.go:98\",\"msg\":\"Not able to establish connection with VC. Restarting property collector.\",\"stacktrace\":\"sigs.k8s.io/vsphere-csi-driver/pkg/syncer/storagepool.initListener.func1\\n\\t/build/mts/release/bora-17165480/cayman_vsphere_csi_driver/vsphere_csi_driver/src/pkg/syncer/storagepool/listener.go:98\"}","kubernetes":{"pod_name":"vsphere-csi-controller-7c646978db-5q5g9","namespace_name":"vmware-system-csi","pod_id":"429e3101-682d-4e45-9e70-9f2c66641b32","labels":{"app":"vsphere-csi-controller","pod-template-hash":"7c646978db","role":"vsphere-csi"},"annotations":{"kubernetes.io/psp":"wcp-privileged-psp"},"host":"420de87c5115994bb75310e8bc809bed","container_name":"vsphere-syncer","docker_id":"d5cb3d3dca1abff78161f01569ad7c3971f0cdc62b98029cda99890b412d4359","container_hash":"localhost:5000/vmware/syncer@sha256:9cb2cca0700423dccd393bbc32040280b67bf7b111736c04bc7515c286e7d0b1","container_image":"localhost:5000/vmware/syncer:v2.1.0-32c3ebd"}}]
kube.var.log.containers.vsphere-csi-controller-7c646978db-5q5g9_vmware-system-csi_vsphere-syncer-d5cb3d3dca1abff78161f01569ad7c3971f0cdc62b98029cda99890b412d4359.log: [1611607763.935898, {"log":"2021-01-25T20:49:23.93586061Z stderr F {\"level\":\"error\",\"time\":\"2021-01-25T20:49:23.935757796Z\",\"caller\":\"vsphere/virtualcenter.go:232\",\"msg\":\"failed to obtain user session with err: Post \\\"https://vc.local:443/sdk\\\": dial tcp: lookup vc.local: no such host\",\"stacktrace\":\"sigs.k8s.io/vsphere-csi-driver/pkg/common/cns-lib/vsphere.(*VirtualCenter).connect\\n\\t/build/mts/release/bora-17165480/cayman_vsphere_csi_driver/vsphere_csi_driver/src/pkg/common/cns-lib/vsphere/virtualcenter.go:232\\nsigs.k8s.io/vsphere-csi-driver/pkg/common/cns-lib/vsphere.(*VirtualCenter).Connect\\n\\t/build/mts/release/bora-17165480/cayman_vsphere_csi_driver/vsphere_csi_driver/src/pkg/common/cns-lib/vsphere/virtualcenter.go:205\\nsigs.k8s.io/vsphere-csi-driver/pkg/syncer/storagepool.initListener.func1\\n\\t/build/mts/release/bora-17165480/cayman_vsphere_csi_driver/vsphere_csi_driver/src/pkg/syncer/storagepool/listener.go:96\"}","kubernetes":{"pod_name":"vsphere-csi-controller-7c646978db-5q5g9","namespace_name":"vmware-system-csi","pod_id":"429e3101-682d-4e45-9e70-9f2c66641b32","labels":{"app":"vsphere-csi-controller","pod-template-hash":"7c646978db","role":"vsphere-csi"},"annotations":{"kubernetes.io/psp":"wcp-privileged-psp"},"host":"420de87c5115994bb75310e8bc809bed","container_name":"vsphere-syncer","docker_id":"d5cb3d3dca1abff78161f01569ad7c3971f0cdc62b98029cda99890b412d4359","container_hash":"localhost:5000/vmware/syncer@sha256:9cb2cca0700423dccd393bbc32040280b67bf7b111736c04bc7515c286e7d0b1","container_image":"localhost:5000/vmware/syncer:v2.1.0-32c3ebd"}}]
kube.var.log.containers.vsphere-csi-controller-7c646978db-5q5g9_vmware-system-csi_vsphere-syncer-d5cb3d3dca1abff78161f01569ad7c3971f0cdc62b98029cda99890b412d4359.log: [1611607763.936509, {"log":"2021-01-25T20:49:23.935989826Z stderr F {\"level\":\"error\",\"time\":\"2021-01-25T20:49:23.935803811Z\",\"caller\":\"vsphere/virtualcenter.go:207\",\"msg\":\"Cannot connect to vCenter with err: Post \\\"https://vc.local:443/sdk\\\": dial tcp: lookup vc.local: no such host\",\"stacktrace\":\"sigs.k8s.io/vsphere-csi-driver/pkg/common/cns-lib/vsphere.(*VirtualCenter).Connect\\n\\t/build/mts/release/bora-17165480/cayman_vsphere_csi_driver/vsphere_csi_driver/src/pkg/common/cns-lib/vsphere/virtualcenter.go:207\\nsigs.k8s.io/vsphere-csi-driver/pkg/syncer/storagepool.initListener.func1\\n\\t/build/mts/release/bora-17165480/cayman_vsphere_csi_driver/vsphere_csi_driver/src/pkg/syncer/storagepool/listener.go:96\"}","kubernetes":{"pod_name":"vsphere-csi-controller-7c646978db-5q5g9","namespace_name":"vmware-system-csi","pod_id":"429e3101-682d-4e45-9e70-9f2c66641b32","labels":{"app":"vsphere-csi-controller","pod-template-hash":"7c646978db","role":"vsphere-csi"},"annotations":{"kubernetes.io/psp":"wcp-privileged-psp"},"host":"420de87c5115994bb75310e8bc809bed","container_name":"vsphere-syncer","docker_id":"d5cb3d3dca1abff78161f01569ad7c3971f0cdc62b98029cda99890b412d4359","container_hash":"localhost:5000/vmware/syncer@sha256:9cb2cca0700423dccd393bbc32040280b67bf7b111736c04bc7515c286e7d0b1","container_image":"localhost:5000/vmware/syncer:v2.1.0-32c3ebd"}}]

 

And the root cause is this:

Cannot connect to vCenter with err: Post \\\"https://vc.local:443/sdk\\\": dial tcp: lookup vc.local: no such host

 

Photon OS 3.0 is not able to resolve .local domains: DNS Related Troubleshooting With Unified Access Gateway (50120424) (vmware.com) and DNS Related Troubleshooting With Unified Access Gateway 3.7 and newer which is based on Photon 3 (78...

So I will have to change my homelab's domain.

Do you use .local too as your domain?

0 Kudos
faebudo
Enthusiast
Enthusiast

That was a red herring, the IOPS just are on another controllerVM now

But I found the file that is written is /var/log/vmware/fluentbit/consolidated.log and the error is as follows:

kube.var.log.containers.vsphere-csi-controller-7c646978db-5q5g9_vmware-system-csi_vsphere-syncer-d5cb3d3dca1abff78161f01569ad7c3971f0cdc62b98029cda99890b412d4359.log: [1611607763.935094, {"log":"2021-01-25T20:49:23.935062778Z stderr F {\"level\":\"error\",\"time\":\"2021-01-25T20:49:23.934908315Z\",\"caller\":\"storagepool/listener.go:98\",\"msg\":\"Not able to establish connection with VC. Restarting property collector.\",\"stacktrace\":\"sigs.k8s.io/vsphere-csi-driver/pkg/syncer/storagepool.initListener.func1\\n\\t/build/mts/release/bora-17165480/cayman_vsphere_csi_driver/vsphere_csi_driver/src/pkg/syncer/storagepool/listener.go:98\"}","kubernetes":{"pod_name":"vsphere-csi-controller-7c646978db-5q5g9","namespace_name":"vmware-system-csi","pod_id":"429e3101-682d-4e45-9e70-9f2c66641b32","labels":{"app":"vsphere-csi-controller","pod-template-hash":"7c646978db","role":"vsphere-csi"},"annotations":{"kubernetes.io/psp":"wcp-privileged-psp"},"host":"420de87c5115994bb75310e8bc809bed","container_name":"vsphere-syncer","docker_id":"d5cb3d3dca1abff78161f01569ad7c3971f0cdc62b98029cda99890b412d4359","container_hash":"localhost:5000/vmware/syncer@sha256:9cb2cca0700423dccd393bbc32040280b67bf7b111736c04bc7515c286e7d0b1","container_image":"localhost:5000/vmware/syncer:v2.1.0-32c3ebd"}}]
kube.var.log.containers.vsphere-csi-controller-7c646978db-5q5g9_vmware-system-csi_vsphere-syncer-d5cb3d3dca1abff78161f01569ad7c3971f0cdc62b98029cda99890b412d4359.log: [1611607763.935898, {"log":"2021-01-25T20:49:23.93586061Z stderr F {\"level\":\"error\",\"time\":\"2021-01-25T20:49:23.935757796Z\",\"caller\":\"vsphere/virtualcenter.go:232\",\"msg\":\"failed to obtain user session with err: Post \\\"https://vc.local:443/sdk\\\": dial tcp: lookup vc.local: no such host\",\"stacktrace\":\"sigs.k8s.io/vsphere-csi-driver/pkg/common/cns-lib/vsphere.(*VirtualCenter).connect\\n\\t/build/mts/release/bora-17165480/cayman_vsphere_csi_driver/vsphere_csi_driver/src/pkg/common/cns-lib/vsphere/virtualcenter.go:232\\nsigs.k8s.io/vsphere-csi-driver/pkg/common/cns-lib/vsphere.(*VirtualCenter).Connect\\n\\t/build/mts/release/bora-17165480/cayman_vsphere_csi_driver/vsphere_csi_driver/src/pkg/common/cns-lib/vsphere/virtualcenter.go:205\\nsigs.k8s.io/vsphere-csi-driver/pkg/syncer/storagepool.initListener.func1\\n\\t/build/mts/release/bora-17165480/cayman_vsphere_csi_driver/vsphere_csi_driver/src/pkg/syncer/storagepool/listener.go:96\"}","kubernetes":{"pod_name":"vsphere-csi-controller-7c646978db-5q5g9","namespace_name":"vmware-system-csi","pod_id":"429e3101-682d-4e45-9e70-9f2c66641b32","labels":{"app":"vsphere-csi-controller","pod-template-hash":"7c646978db","role":"vsphere-csi"},"annotations":{"kubernetes.io/psp":"wcp-privileged-psp"},"host":"420de87c5115994bb75310e8bc809bed","container_name":"vsphere-syncer","docker_id":"d5cb3d3dca1abff78161f01569ad7c3971f0cdc62b98029cda99890b412d4359","container_hash":"localhost:5000/vmware/syncer@sha256:9cb2cca0700423dccd393bbc32040280b67bf7b111736c04bc7515c286e7d0b1","container_image":"localhost:5000/vmware/syncer:v2.1.0-32c3ebd"}}]
kube.var.log.containers.vsphere-csi-controller-7c646978db-5q5g9_vmware-system-csi_vsphere-syncer-d5cb3d3dca1abff78161f01569ad7c3971f0cdc62b98029cda99890b412d4359.log: [1611607763.936509, {"log":"2021-01-25T20:49:23.935989826Z stderr F {\"level\":\"error\",\"time\":\"2021-01-25T20:49:23.935803811Z\",\"caller\":\"vsphere/virtualcenter.go:207\",\"msg\":\"Cannot connect to vCenter with err: Post \\\"https://vc.local:443/sdk\\\": dial tcp: lookup vc.local: no such host\",\"stacktrace\":\"sigs.k8s.io/vsphere-csi-driver/pkg/common/cns-lib/vsphere.(*VirtualCenter).Connect\\n\\t/build/mts/release/bora-17165480/cayman_vsphere_csi_driver/vsphere_csi_driver/src/pkg/common/cns-lib/vsphere/virtualcenter.go:207\\nsigs.k8s.io/vsphere-csi-driver/pkg/syncer/storagepool.initListener.func1\\n\\t/build/mts/release/bora-17165480/cayman_vsphere_csi_driver/vsphere_csi_driver/src/pkg/syncer/storagepool/listener.go:96\"}","kubernetes":{"pod_name":"vsphere-csi-controller-7c646978db-5q5g9","namespace_name":"vmware-system-csi","pod_id":"429e3101-682d-4e45-9e70-9f2c66641b32","labels":{"app":"vsphere-csi-controller","pod-template-hash":"7c646978db","role":"vsphere-csi"},"annotations":{"kubernetes.io/psp":"wcp-privileged-psp"},"host":"420de87c5115994bb75310e8bc809bed","container_name":"vsphere-syncer","docker_id":"d5cb3d3dca1abff78161f01569ad7c3971f0cdc62b98029cda99890b412d4359","container_hash":"localhost:5000/vmware/syncer@sha256:9cb2cca0700423dccd393bbc32040280b67bf7b111736c04bc7515c286e7d0b1","container_image":"localhost:5000/vmware/syncer:v2.1.0-32c3ebd"}}]

And the root cause is this:

Cannot connect to vCenter with err: Post \\\"https://vc.local:443/sdk\\\": dial tcp: lookup vc.local: no such host

But now I am not sure why it cannot resolve this, all DNS I use can indeed resolve vc.local

I will look at this more in-depth now

0 Kudos
faebudo
Enthusiast
Enthusiast

That was a red herring, the IOPS just are on another controllerVM now

But I found the file that is written is /var/log/vmware/fluentbit/consolidated.log and the error is as follows:

kube.var.log.containers.vsphere-csi-controller-7c646978db-5q5g9_vmware-system-csi_vsphere-syncer-d5cb3d3dca1abff78161f01569ad7c3971f0cdc62b98029cda99890b412d4359.log: [1611607763.935094, {"log":"2021-01-25T20:49:23.935062778Z stderr F {\"level\":\"error\",\"time\":\"2021-01-25T20:49:23.934908315Z\",\"caller\":\"storagepool/listener.go:98\",\"msg\":\"Not able to establish connection with VC. Restarting property collector.\",\"stacktrace\":\"sigs.k8s.io/vsphere-csi-driver/pkg/syncer/storagepool.initListener.func1\\n\\t/build/mts/release/bora-17165480/cayman_vsphere_csi_driver/vsphere_csi_driver/src/pkg/syncer/storagepool/listener.go:98\"}","kubernetes":{"pod_name":"vsphere-csi-controller-7c646978db-5q5g9","namespace_name":"vmware-system-csi","pod_id":"429e3101-682d-4e45-9e70-9f2c66641b32","labels":{"app":"vsphere-csi-controller","pod-template-hash":"7c646978db","role":"vsphere-csi"},"annotations":{"kubernetes.io/psp":"wcp-privileged-psp"},"host":"420de87c5115994bb75310e8bc809bed","container_name":"vsphere-syncer","docker_id":"d5cb3d3dca1abff78161f01569ad7c3971f0cdc62b98029cda99890b412d4359","container_hash":"localhost:5000/vmware/syncer@sha256:9cb2cca0700423dccd393bbc32040280b67bf7b111736c04bc7515c286e7d0b1","container_image":"localhost:5000/vmware/syncer:v2.1.0-32c3ebd"}}]
kube.var.log.containers.vsphere-csi-controller-7c646978db-5q5g9_vmware-system-csi_vsphere-syncer-d5cb3d3dca1abff78161f01569ad7c3971f0cdc62b98029cda99890b412d4359.log: [1611607763.935898, {"log":"2021-01-25T20:49:23.93586061Z stderr F {\"level\":\"error\",\"time\":\"2021-01-25T20:49:23.935757796Z\",\"caller\":\"vsphere/virtualcenter.go:232\",\"msg\":\"failed to obtain user session with err: Post \\\"https://vc.local:443/sdk\\\": dial tcp: lookup vc.local: no such host\",\"stacktrace\":\"sigs.k8s.io/vsphere-csi-driver/pkg/common/cns-lib/vsphere.(*VirtualCenter).connect\\n\\t/build/mts/release/bora-17165480/cayman_vsphere_csi_driver/vsphere_csi_driver/src/pkg/common/cns-lib/vsphere/virtualcenter.go:232\\nsigs.k8s.io/vsphere-csi-driver/pkg/common/cns-lib/vsphere.(*VirtualCenter).Connect\\n\\t/build/mts/release/bora-17165480/cayman_vsphere_csi_driver/vsphere_csi_driver/src/pkg/common/cns-lib/vsphere/virtualcenter.go:205\\nsigs.k8s.io/vsphere-csi-driver/pkg/syncer/storagepool.initListener.func1\\n\\t/build/mts/release/bora-17165480/cayman_vsphere_csi_driver/vsphere_csi_driver/src/pkg/syncer/storagepool/listener.go:96\"}","kubernetes":{"pod_name":"vsphere-csi-controller-7c646978db-5q5g9","namespace_name":"vmware-system-csi","pod_id":"429e3101-682d-4e45-9e70-9f2c66641b32","labels":{"app":"vsphere-csi-controller","pod-template-hash":"7c646978db","role":"vsphere-csi"},"annotations":{"kubernetes.io/psp":"wcp-privileged-psp"},"host":"420de87c5115994bb75310e8bc809bed","container_name":"vsphere-syncer","docker_id":"d5cb3d3dca1abff78161f01569ad7c3971f0cdc62b98029cda99890b412d4359","container_hash":"localhost:5000/vmware/syncer@sha256:9cb2cca0700423dccd393bbc32040280b67bf7b111736c04bc7515c286e7d0b1","container_image":"localhost:5000/vmware/syncer:v2.1.0-32c3ebd"}}]
kube.var.log.containers.vsphere-csi-controller-7c646978db-5q5g9_vmware-system-csi_vsphere-syncer-d5cb3d3dca1abff78161f01569ad7c3971f0cdc62b98029cda99890b412d4359.log: [1611607763.936509, {"log":"2021-01-25T20:49:23.935989826Z stderr F {\"level\":\"error\",\"time\":\"2021-01-25T20:49:23.935803811Z\",\"caller\":\"vsphere/virtualcenter.go:207\",\"msg\":\"Cannot connect to vCenter with err: Post \\\"https://vc.local:443/sdk\\\": dial tcp: lookup vc.local: no such host\",\"stacktrace\":\"sigs.k8s.io/vsphere-csi-driver/pkg/common/cns-lib/vsphere.(*VirtualCenter).Connect\\n\\t/build/mts/release/bora-17165480/cayman_vsphere_csi_driver/vsphere_csi_driver/src/pkg/common/cns-lib/vsphere/virtualcenter.go:207\\nsigs.k8s.io/vsphere-csi-driver/pkg/syncer/storagepool.initListener.func1\\n\\t/build/mts/release/bora-17165480/cayman_vsphere_csi_driver/vsphere_csi_driver/src/pkg/syncer/storagepool/listener.go:96\"}","kubernetes":{"pod_name":"vsphere-csi-controller-7c646978db-5q5g9","namespace_name":"vmware-system-csi","pod_id":"429e3101-682d-4e45-9e70-9f2c66641b32","labels":{"app":"vsphere-csi-controller","pod-template-hash":"7c646978db","role":"vsphere-csi"},"annotations":{"kubernetes.io/psp":"wcp-privileged-psp"},"host":"420de87c5115994bb75310e8bc809bed","container_name":"vsphere-syncer","docker_id":"d5cb3d3dca1abff78161f01569ad7c3971f0cdc62b98029cda99890b412d4359","container_hash":"localhost:5000/vmware/syncer@sha256:9cb2cca0700423dccd393bbc32040280b67bf7b111736c04bc7515c286e7d0b1","container_image":"localhost:5000/vmware/syncer:v2.1.0-32c3ebd"}}]

And the root cause is this:

Cannot connect to vCenter with err: Post \\\"https://vc.local:443/sdk\\\": dial tcp: lookup vc.local: no such host

But now I am not sure why it cannot resolve this, all DNS I use can indeed resolve vc.local

I will look at this more in-depth now

Edit:

Photon OS 3.0 is not able to resolve .local domains: DNS Related Troubleshooting With Unified Access Gateway (50120424) (vmware.com) and DNS Related Troubleshooting With Unified Access Gateway 3.7 and newer which is based on Photon 3 (78...

So I will have to change my homelab's domain.

Do you use .local too as your domain?

0 Kudos
faebudo
Enthusiast
Enthusiast

My torough answer always gets deleted but it boils down to:

Do not use a .local domain for your vCenter, Photon OS 3.0 fails resolving .local and the controller tries connecting to your vcenter thousands of times which creates many logs in /var/log/vmware/fluentbit/consolidated.log

0 Kudos
frozenak
Enthusiast
Enthusiast

I am using a .local

Took a look at that log, and can find no references to my vcenter by name or FQDN, no DNS errors.

The VM can also resolve the hostname just fine. I provided the DNS Search Domain during workload management deployment.

root@42068944986ef297cc64a8a2a234c0dd [ ~ ]# cat /var/log/vmware/fluentbit/consolidated.log | grep vcenter.incendiary.local
root@42068944986ef297cc64a8a2a234c0dd [ ~ ]# cat /var/log/vmware/fluentbit/consolidated.log | grep vcenter

root@42068944986ef297cc64a8a2a234c0dd [ ~ ]# cat /var/log/vmware/fluentbit/consolidated.log | grep 192.168.1.100
root@42068944986ef297cc64a8a2a234c0dd [ ~ ]# ping vcenter
PING vcenter.incendiary.local (192.168.1.100) 56(84) bytes of data.
64 bytes from vcenter.incendiary.local (192.168.1.100): icmp_seq=1 ttl=62 time=4.51 ms
64 bytes from vcenter.incendiary.local (192.168.1.100): icmp_seq=2 ttl=62 time=3.85 ms
64 bytes from vcenter.incendiary.local (192.168.1.100): icmp_seq=3 ttl=62 time=4.72 ms
64 bytes from vcenter.incendiary.local (192.168.1.100): icmp_seq=4 ttl=62 time=3.79 ms

Only errors I can find are (primarily) related to certificates.

1:57102: remote error: tls: bad certificate","kubernetes":{"pod_name":"vmware-system-license-operator-controller-manager-7d8c9679l8dcq","namespace_name":"vmware-system-license-operator","pod_id":"c9941aeb-e0f3-470e-b4ed-1088226e751d","labels":{"control-plane":"license-operator","pod-template-hash":"7d8c9679b8"},"annotations":{"kubernetes.io/psp":"wcp-privileged-psp"},"host":"42068944986ef297cc64a8a2a234c0dd","container_name":"manager","docker_id":"1dc94e93d423362e797945988248e7df7c8fcb9d5b79124ec57a95a79644a446","container_hash":"localhost:5000/vmware/license-operator@sha256:ada038288ec9b5b752bfc0977cbbc3de694bf4e901043dce9620a06ef8fe0a23","container_image":"localhost:5000/vmware/license-operator:031b086"}}]

Though there are failures calling webhook

kube.var.log.containers.kube-apiserver-42068944986ef297cc64a8a2a234c0dd_kube-system_kube-apiserver-22b56fc4db84352c49181d41d260294ffa395a38af5bbadcacd3891653ea7d1d.log: [1611612175.872795, {"log":"2021-01-25T22:02:55.872387647Z stderr F W0125 22:02:55.872299 1 dispatcher.go:128] Failed calling webhook, failing open default.validating.license.supervisor.vmware.com: failed calling webhook \"default.validating.license.supervisor.vmware.com\": Post https://vmware-system-license-operator-webhook-service.vmware-system-license-operator.svc:443/superv...: x509: certificate signed by unknown authority","kubernetes":{"pod_name":"kube-apiserver-42068944986ef297cc64a8a2a234c0dd","namespace_name":"kube-system","pod_id":"4d13c00e-cef1-40c7-b5eb-193377285045","labels":{"component":"kube-apiserver","tier":"control-plane"},"annotations":{"kubeadm.kubernetes.io/kube-apiserver.advertise-address.endpoint":"10.0.13.130:6443","kubernetes.io/config.hash":"edadeb83c3d11c3a33abe5011f6cc5d4","kubernetes.io/config.mirror":"edadeb83c3d11c3a33abe5011f6cc5d4","kubernetes.io/config.seen":"2021-01-25T04:06:51.110073647Z","kubernetes.io/config.source":"file","kubernetes.io/psp":"wcp-privileged-psp"},"host":"42068944986ef297cc64a8a2a234c0dd","container_name":"kube-apiserver","docker_id":"c5d9a31d70ab45347fbc983aa8b854d99385791d28220557502382d0e0e3267c"}}]

What specifically would I need to look for in here to determine or validate if it is an issue with a .local?

Thanks for the tip on pidstat; a tool I had not yet found. I do see that it is caused by fluent-bit

0 Kudos
frozenak
Enthusiast
Enthusiast

Ok so not sure why grep didn't find it, but I found the error lines where it lists my vcenter. But, out of the 63,645 lines of log, vcenter is only found 683 times.

Spoiler
kube.var.log.containers.vsphere-csi-controller-7c646978db-dv8tn_vmware-system-csi_vsphere-syncer-117430760f80fea02e666c34a71f32c1f88887151b5e78639e9ff047a9e7b1b2.log: [1611621903.859503, {"log":"2021-01-26T00:45:03.859325213Z stderr F {\"level\":\"error\",\"time\":\"2021-01-26T00:45:03.859202328Z\",\"caller\":\"storagepool/intended_state.go:183\",\"msg\":\"Error fetching cluster properties for domain-c7. Err: Post \\\"https://vcenter.incendiary.local:443/sdk\\\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)\",\"TraceId\":\"2ed71762-b82f-4cfc-ad35-d82394e9ab6e\",\"stacktrace\":\"sigs.k8s.io/vsphere-csi-driver/pkg/syncer/storagepool.isRemoteVsan\\n\\t/build/mts/release/bora-17165480/cayman_vsphere_csi_driver/vsphere_csi_driver/src/pkg/syncer/storagepool/intended_state.go:183\\nsigs.k8s.io/vsphere-csi-driver/pkg/syncer/storagepool.newIntendedState\\n\\t/build/mts/release/bora-17165480/cayman_vsphere_csi_driver/vsphere_csi_driver/src/pkg/syncer/storagepool/intended...
kube.var.log.containers.vsphere-csi-controller-7c646978db-dv8tn_vmware-system-csi_vsphere-syncer-117430760f80fea02e666c34a71f32c1f88887151b5e78639e9ff047a9e7b1b2.log: [1611621903.859504, {"log":"2021-01-26T00:45:03.859427354Z stderr F {\"level\":\"error\",\"time\":\"2021-01-26T00:45:03.859363765Z\",\"caller\":\"storagepool/intended_state.go:148\",\"msg\":\"Not able to determine whether Datastore is vSAN Remote. Post \\\"https://vcenter.incendiary.local:443/sdk\\\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)\",\"TraceId\":\"2ed71762-b82f-4cfc-ad35-d82394e9ab6e\",\"stacktrace\":\"sigs.k8s.io/vsphere-csi-driver/pkg/syncer/storagepool.newIntendedState\\n\\t/build/mts/release/bora-17165480/cayman_vsphere_csi_driver/vsphere_csi_driver/src/pkg/syncer/storagepool/intended_state.go:148\\nsigs.k8s.io/vsphere-csi-driver/pkg/syncer/storagepool.ReconcileAllStoragePools\\n\\t/build/mts/release/bora-17165480/cayman_vsphere_csi_driver/vsphere_csi_driver/src/pkg/syncer/storag...
kube.var.log.containers.vsphere-csi-controller-7c646978db-dv8tn_vmware-system-csi_vsphere-syncer-117430760f80fea02e666c34a71f32c1f88887151b5e78639e9ff047a9e7b1b2.log: [1611621903.860196, {"log":"2021-01-26T00:45:03.859654515Z stderr F {\"level\":\"error\",\"time\":\"2021-01-26T00:45:03.859446348Z\",\"caller\":\"storagepool/listener.go:226\",\"msg\":\"Error reconciling StoragePool for datastore datastore-11. Err: Post \\\"https://vcenter.incendiary.local:443/sdk\\\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)\",\"TraceId\":\"2ed71762-b82f-4cfc-ad35-d82394e9ab6e\",\"stacktrace\":\"sigs.k8s.io/vsphere-csi-driver/pkg/syncer/storagepool.ReconcileAllStoragePools\\n\\t/build/mts/release/bora-17165480/cayman_vsphere_csi_driver/vsphere_csi_driver/src/pkg/syncer/storagepool/listener.go:226\\nsigs.k8s.io/vsphere-csi-driver/pkg/syncer/storagepool.scheduleReconcileAllStoragePools.func1\\n\\t/build/mts/release/bora-17165480/cayman_vsphere_csi_driver/vsphere_csi_driver/src/...
kube.var.log.containers.coredns-5dc469f6f-8bdfl_kube-system_coredns-52db7b11eb2e6cb606bdf3f66e11c8fa76ba9def1fa58a030927e6a7df9bb7c2.log: [1611621914.930205, {"log":"2021-01-26T00:45:14.930134451Z stdout F [INFO] 10.0.11.129:36657 - 45778 \"A IN vcenter.incendiary.local.vmware-system-kubeimage.svc.cluster.local. udp 84 false 512\" NXDOMAIN qr,aa,rd 177 0.000170379s","kubernetes":{"pod_name":"coredns-5dc469f6f-8bdfl","namespace_name":"kube-system","pod_id":"7d43881e-8f67-4958-a628-b3ad9ee2f3ba","labels":{"k8s-app":"kube-dns","pod-template-hash":"5dc469f6f"},"annotations":{"kubernetes.io/psp":"wcp-privileged-psp"},"host":"420631715fd538d484265a9a44e451cb","container_name":"coredns","docker_id":"52db7b11eb2e6cb606bdf3f66e11c8fa76ba9def1fa58a030927e6a7df9bb7c2","container_hash":"localhost:5000/vmware/coredns@sha256:000746b5e05267b0f7b61066ee559e07be352849a8f05d801a61334731cf567d","container_image":"localhost:5000/vmware/coredns:v1.18"}}]

Then pulled the /var/log/pods/vmware-system-csi_vsphere-csi-controller-7c646978db-dv8tn_4b4f30e2-85d2-4d2a-8d39-908f1127d51a/vsphere-syncer/3.log

Spoiler
2021-01-25T05:03:54.190144872Z stderr F {"level":"info","time":"2021-01-25T05:03:54.190078459Z","caller":"storagepool/intended_state.go:292","msg":"Updating StoragePool instance for storagepool-vsandatastore","TraceId":"cfa9f173-96d9-4394-bc29-91acad1a7fca"}
2021-01-25T05:03:55.01110124Z stderr F {"level":"error","time":"2021-01-25T05:03:55.010962349Z","caller":"vsphere/hostsystem.go:56","msg":"failed to get datastore managed objects from datastore objects [Datastore:datastore-6955 Datastore:datastore-1074 Datastore:datastore-1093 Datastore:datastore-1075 Datastore:datastore-1073 Datastore:datastore-780 Datastore:datastore-11] with properties [info]: Post \"https://vcenter.incendiary.local:443/sdk\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)","TraceId":"fbdb1f9f-52c8-4ba8-b1bc-dc2b894dc108","stacktrace":"sigs.k8s.io/vsphere-csi-driver/pkg/common/cns-lib/vsphere.(*HostSystem).GetAllAccessibleDatastores\n\t/build/mts/release/bora-17165480/cayman_vsphere_csi_driver/vsphere_csi_driver/src/pkg/common/cns-lib/vsphere/hostsystem.go:56\nsigs.k8s.io/vsphere-csi-driver/pkg/common/cns-lib/vsphere.(*VirtualCenter).GetDatastoresByCluster\n\t/build/mts/release/bora-17165480/cayman_vsphere_csi_driver/vsphere_csi_driver/src/pkg/common/cns-lib/vsphere/virtualcenter.go:431\nsigs.k8s.io/vsphere-csi-driver/pkg/common/cns-lib/vsphere.getVsanDirectDatastores\n\t/build/mts/release/bora-17165480/cayman_vsphere_csi_driver/vsphere_csi_driver/src/pkg/common/cns-lib/vsphere/utils.go:318\nsigs.k8s.io/vsphere-csi-driver/pkg/common/cns-lib/vsphere.GetCandidateDatastoresInCluster\n\t/build/mts/release/bora-17165480/cayman_vsphere_csi_driver/vsphere_csi_driver/src/pkg/common/cns-lib/vsphere/utils.go:258\nsigs.k8s.io/vsphere-csi-driver/pkg/syncer/storagepool.ReconcileAllStoragePools\n\t/build/mts/release/bora-17165480/cayman_vsphere_csi_driver/vsphere_csi_driver/src/pkg/syncer/storagepool/listener.go:213\nsigs.k8s.io/vsphere-csi-driver/pkg/syncer/storagepool.scheduleReconcileAllStoragePools.func1\n\t/build/mts/release/bora-17165480/cayman_vsphere_csi_driver/vsphere_csi_driver/src/pkg/syncer/storagepool/listener.go:180"}
2021-01-25T05:03:55.011240901Z stderr F {"level":"error","time":"2021-01-25T05:03:55.011162316Z","caller":"vsphere/virtualcenter.go:433","msg":"Failed to fetch datastores from host HostSystem:host-775. Err: Post \"https://vcenter.incendiary.local:443/sdk\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)","TraceId":"fbdb1f9f-52c8-4ba8-b1bc-dc2b894dc108","stacktrace":"sigs.k8s.io/vsphere-csi-driver/pkg/common/cns-lib/vsphere.(*VirtualCenter).GetDatastoresByCluster\n\t/build/mts/release/bora-17165480/cayman_vsphere_csi_driver/vsphere_csi_driver/src/pkg/common/cns-lib/vsphere/virtualcenter.go:433\nsigs.k8s.io/vsphere-csi-driver/pkg/common/cns-lib/vsphere.getVsanDirectDatastores\n\t/build/mts/release/bora-17165480/cayman_vsphere_csi_driver/vsphere_csi_driver/src/pkg/common/cns-lib/vsphere/utils.go:318\nsigs.k8s.io/vsphere-csi-driver/pkg/common/cns-lib/vsphere.GetCandidateDatastoresInCluster\n\t/build/mts/release/bora-17165480/cayman_vsphere_csi_driver/vsphere_csi_driver/src/pkg/common/cns-lib/vsphere/utils.go:258\nsigs.k8s.io/vsphere-csi-driver/pkg/syncer/storagepool.ReconcileAllStoragePools\n\t/build/mts/release/bora-17165480/cayman_vsphere_csi_driver/vsphere_csi_driver/src/pkg/syncer/storagepool/listener.go:213\nsigs.k8s.io/vsphere-csi-driver/pkg/syncer/storagepool.scheduleReconcileAllStoragePools.func1\n\t/build/mts/release/bora-17165480/cayman_vsphere_csi_driver/vsphere_csi_driver/src/pkg/syncer/storagepool/listener.go:180"}
2021-01-25T05:03:55.011294472Z stderr F {"level":"warn","time":"2021-01-25T05:03:55.011260791Z","caller":"vsphere/utils.go:320","msg":"Not able to fetch datastores in cluster domain-c7. Err: Post \"https://vcenter.incendiary.local:443/sdk\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)","TraceId":"fbdb1f9f-52c8-4ba8-b1bc-dc2b894dc108"}
2021-01-25T05:03:55.011364281Z stderr F {"level":"error","time":"2021-01-25T05:03:55.011322963Z","caller":"storagepool/listener.go:215","msg":"Failed to find datastores from VC. Err: Failed to get vSAN Direct VMFS datastores. Err: Post \"https://vcenter.incendiary.local:443/sdk\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)","TraceId":"fbdb1f9f-52c8-4ba8-b1bc-dc2b894dc108","stacktrace":"sigs.k8s.io/vsphere-csi-driver/pkg/syncer/storagepool.ReconcileAllStoragePools\n\t/build/mts/release/bora-17165480/cayman_vsphere_csi_driver/vsphere_csi_driver/src/pkg/syncer/storagepool/listener.go:215\nsigs.k8s.io/vsphere-csi-driver/pkg/syncer/storagepool.scheduleReconcileAllStoragePools.func1\n\t/build/mts/release/bora-17165480/cayman_vsphere_csi_driver/vsphere_csi_driver/src/pkg/syncer/storagepool/listener.go:180"}
2021-01-25T05:03:55.011520088Z stderr F {"level":"error","time":"2021-01-25T05:03:55.011472464Z","caller":"storagepool/listener.go:182","msg":"[iteration-1] Error reconciling StoragePool instances in HostMount listener. Err: Failed to get vSAN Direct VMFS datastores. Err: Post \"https://vcenter.incendiary.local:443/sdk\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)","TraceId":"fbdb1f9f-52c8-4ba8-b1bc-dc2b894dc108","stacktrace":"sigs.k8s.io/vsphere-csi-driver/pkg/syncer/storagepool.scheduleReconcileAllStoragePools.func1\n\t/build/mts/release/bora-17165480/cayman_vsphere_csi_driver/vsphere_csi_driver/src/pkg/syncer/storagepool/listener.go:182"}

It's getting an error on post.

If I curl on URL without HTTPS it errors with recv failure. If I CURL with HTTPS it returns nothing.

root@42068944986ef297cc64a8a2a234c0dd [ ~ ]# curl vcenter.incendiary.local:443/sdk
curl: (56) Recv failure: Connection reset by peer
root@42068944986ef297cc64a8a2a234c0dd [ ~ ]# curl https://vcenter.incendiary.local:443/sdk
root@42068944986ef297cc64a8a2a234c0dd [ ~ ]#

Yet it does still find the datastores.

Spoiler
2021-01-25T05:05:01.073693895Z stderr F {"level":"info","time":"2021-01-25T05:05:01.073619216Z","caller":"storagepool/util.go:86","msg":"Datastore NFS_WINDOWS properties: {NFS_WINDOWS ds:///vmfs/volumes/5c8f5271-184b8dc3/ cns.vmware.com/NFS false true 0xc0004e30c0 0xc0004e3140}","TraceId":"1b3fbeaf-65c2-494a-9a0d-34a489e4745b"}
2021-01-25T05:05:02.317564633Z stderr F {"level":"info","time":"2021-01-25T05:05:02.317478655Z","caller":"kubernetes/kubernetes.go:83","msg":"k8s client using in-cluster config","TraceId":"1b3fbeaf-65c2-494a-9a0d-34a489e4745b"}
2021-01-25T05:05:02.317816514Z stderr F {"level":"info","time":"2021-01-25T05:05:02.317768568Z","caller":"kubernetes/kubernetes.go:267","msg":"Setting client QPS to 100.000000 and Burst to 100.","TraceId":"1b3fbeaf-65c2-494a-9a0d-34a489e4745b"}
2021-01-25T05:05:02.322251618Z stderr F {"level":"info","time":"2021-01-25T05:05:02.322181167Z","caller":"storagepool/util.go:148","msg":"Accessible nodes in MM for datastore datastore-1073: map[]","TraceId":"1b3fbeaf-65c2-494a-9a0d-34a489e4745b"}
2021-01-25T05:05:02.587407028Z stderr F {"level":"info","time":"2021-01-25T05:05:02.587303812Z","caller":"storagepool/intended_state.go:292","msg":"Updating StoragePool instance for storagepool-nfs-windows","TraceId":"1b3fbeaf-65c2-494a-9a0d-34a489e4745b"}
2021-01-25T05:05:02.862846381Z stderr F {"level":"info","time":"2021-01-25T05:05:02.862768973Z","caller":"storagepool/util.go:86","msg":"Datastore NFS_VMWARE properties: {NFS_VMWARE ds:///vmfs/volumes/42a229b4-23df8e96/ cns.vmware.com/NFS false true 0xc000c0db80 0xc000c0dc80}","TraceId":"1b3fbeaf-65c2-494a-9a0d-34a489e4745b"}
2021-01-25T05:05:02.883707579Z stderr F {"level":"info","time":"2021-01-25T05:05:02.88365006Z","caller":"kubernetes/kubernetes.go:83","msg":"k8s client using in-cluster config","TraceId":"1b3fbeaf-65c2-494a-9a0d-34a489e4745b"}
2021-01-25T05:05:02.88388984Z stderr F {"level":"info","time":"2021-01-25T05:05:02.883833708Z","caller":"kubernetes/kubernetes.go:267","msg":"Setting client QPS to 100.000000 and Burst to 100.","TraceId":"1b3fbeaf-65c2-494a-9a0d-34a489e4745b"}
2021-01-25T05:05:02.888059201Z stderr F {"level":"info","time":"2021-01-25T05:05:02.887992091Z","caller":"storagepool/util.go:148","msg":"Accessible nodes in MM for datastore datastore-1074: map[]","TraceId":"1b3fbeaf-65c2-494a-9a0d-34a489e4745b"}
2021-01-25T05:05:02.904224787Z stderr F {"level":"info","time":"2021-01-25T05:05:02.904142311Z","caller":"storagepool/intended_state.go:292","msg":"Updating StoragePool instance for storagepool-nfs-vmware","TraceId":"1b3fbeaf-65c2-494a-9a0d-34a489e4745b"}
2021-01-25T05:05:03.178846694Z stderr F {"level":"info","time":"2021-01-25T05:05:03.178771811Z","caller":"storagepool/util.go:86","msg":"Datastore NFS_OTHEROS properties: {NFS_OTHEROS ds:///vmfs/volumes/58f61d0f-d0e45dee/ cns.vmware.com/NFS false true 0xc000c78640 0xc000c78680}","TraceId":"1b3fbeaf-65c2-494a-9a0d-34a489e4745b"}
2021-01-25T05:05:03.885254867Z stderr F {"level":"warn","time":"2021-01-25T05:05:03.88518438Z","caller":"syncer/metadatasyncer.go:1147","msg":"CSI migration feature state is disabled","TraceId":"d8814d19-9ebf-4965-9043-918ddb9dc3fd"}
2021-01-25T05:05:04.431081848Z stderr F {"level":"info","time":"2021-01-25T05:05:04.43102339Z","caller":"storagepool/listener.go:108","msg":"Got 3 property collector update(s)","TraceId":"95339f37-12c8-47af-bffe-b86976bc80de"}
2021-01-25T05:05:04.433706066Z stderr F {"level":"info","time":"2021-01-25T05:05:04.433660506Z","caller":"storagepool/intended_state.go:292","msg":"Updating StoragePool instance for storagepool-vsandatastore","TraceId":"95339f37-12c8-47af-bffe-b86976bc80de"}
2021-01-25T05:05:04.706783526Z stderr F {"level":"info","time":"2021-01-25T05:05:04.706723598Z","caller":"storagepool/listener.go:108","msg":"Got 3 property collector update(s)","TraceId":"785e5535-5210-494f-82a1-28d83cd782cd"}
2021-01-25T05:05:04.709147576Z stderr F {"level":"info","time":"2021-01-25T05:05:04.709096523Z","caller":"storagepool/intended_state.go:292","msg":"Updating StoragePool instance for storagepool-vsandatastore","TraceId":"785e5535-5210-494f-82a1-28d83cd782cd"}
2021-01-25T05:05:06.29654173Z stderr F {"level":"info","time":"2021-01-25T05:05:06.29567661Z","caller":"kubernetes/kubernetes.go:83","msg":"k8s client using in-cluster config","TraceId":"1b3fbeaf-65c2-494a-9a0d-34a489e4745b"}
2021-01-25T05:05:06.296627501Z stderr F {"level":"info","time":"2021-01-25T05:05:06.29579805Z","caller":"kubernetes/kubernetes.go:267","msg":"Setting client QPS to 100.000000 and Burst to 100.","TraceId":"1b3fbeaf-65c2-494a-9a0d-34a489e4745b"}
2021-01-25T05:05:06.299606032Z stderr F {"level":"info","time":"2021-01-25T05:05:06.299400957Z","caller":"storagepool/util.go:148","msg":"Accessible nodes in MM for datastore datastore-1075: map[]","TraceId":"1b3fbeaf-65c2-494a-9a0d-34a489e4745b"}
2021-01-25T05:05:06.314448941Z stderr F {"level":"info","time":"2021-01-25T05:05:06.314064802Z","caller":"storagepool/intended_state.go:292","msg":"Updating StoragePool instance for storagepool-nfs-otheros","TraceId":"1b3fbeaf-65c2-494a-9a0d-34a489e4745b"}
2021-01-25T05:05:06.594879408Z stderr F {"level":"info","time":"2021-01-25T05:05:06.594801767Z","caller":"storagepool/util.go:86","msg":"Datastore ESX_LOGS properties: {ESX_LOGS ds:///vmfs/volumes/4b2689f1-516b524c/ cns.vmware.com/NFS false true 0xc0001d1740 0xc0001d1bc0}","TraceId":"1b3fbeaf-65c2-494a-9a0d-34a489e4745b"}
2021-01-25T05:05:13.131056131Z stderr F {"level":"info","time":"2021-01-25T05:05:13.130985394Z","caller":"kubernetes/kubernetes.go:83","msg":"k8s client using in-cluster config","TraceId":"1b3fbeaf-65c2-494a-9a0d-34a489e4745b"}
2021-01-25T05:05:13.13120221Z stderr F {"level":"info","time":"2021-01-25T05:05:13.131163088Z","caller":"kubernetes/kubernetes.go:267","msg":"Setting client QPS to 100.000000 and Burst to 100.","TraceId":"1b3fbeaf-65c2-494a-9a0d-34a489e4745b"}
2021-01-25T05:05:13.135102161Z stderr F {"level":"info","time":"2021-01-25T05:05:13.135050764Z","caller":"storagepool/util.go:148","msg":"Accessible nodes in MM for datastore datastore-1093: map[]","TraceId":"1b3fbeaf-65c2-494a-9a0d-34a489e4745b"}
2021-01-25T05:05:13.151029814Z stderr F {"level":"info","time":"2021-01-25T05:05:13.150971316Z","caller":"storagepool/intended_state.go:292","msg":"Updating StoragePool instance for storagepool-esx-logs","TraceId":"1b3fbeaf-65c2-494a-9a0d-34a489e4745b"}
2021-01-25T05:05:13.422947705Z stderr F {"level":"info","time":"2021-01-25T05:05:13.422879682Z","caller":"storagepool/util.go:86","msg":"Datastore vsanDatastore properties: {vsanDatastore ds:///vmfs/volumes/vsan:520c4a3a23e9f88e-71a4ba7e3aff119d/ cns.vmware.com/vsan 520c4a3a23e9f88e-71a4ba7e3aff119d false true 0xc0012f5f80 0xc0012f5fc0}","TraceId":"1b3fbeaf-65c2-494a-9a0d-34a489e4745b"}
2021-01-25T05:05:16.532840796Z stderr F {"level":"info","time":"2021-01-25T05:05:16.532778338Z","caller":"kubernetes/kubernetes.go:83","msg":"k8s client using in-cluster config","TraceId":"1b3fbeaf-65c2-494a-9a0d-34a489e4745b"}
0 Kudos
frozenak
Enthusiast
Enthusiast

Looks like /var/log/vmware/audit/kube-apiserver.log logs a lot too. 1GB in 6 hours, though ls doesn't display 3 of the files. Averaging 1 new log every 30-45 minutes, at 100MB each. Time provided by ls doesn't match up with actual log lines or date/time changed as displayed in WinSCP.
12:30-18:30 is 11 files @ 1.04GB

Spoiler
root@42068944986ef297cc64a8a2a234c0dd [ ~ ]# ls -A /var/log/vmware/audit/ -s -h -l
total 714M
101M -rw-r--r-- 1 root root 100M Jan 25 07:56 kube-apiserver-2021-01-25T07-56-53.141.log
101M -rw-r--r-- 1 root root 100M Jan 25 11:08 kube-apiserver-2021-01-25T11-08-56.313.log
101M -rw-r--r-- 1 root root 100M Jan 25 14:16 kube-apiserver-2021-01-25T14-16-41.535.log
101M -rw-r--r-- 1 root root 100M Jan 25 17:22 kube-apiserver-2021-01-25T17-22-29.788.log
101M -rw-r--r-- 1 root root 100M Jan 25 20:27 kube-apiserver-2021-01-25T20-27-12.877.log
101M -rw-r--r-- 1 root root 100M Jan 25 23:32 kube-apiserver-2021-01-25T23-32-36.379.log
101M -rw-r--r-- 1 root root 100M Jan 26 02:37 kube-apiserver-2021-01-26T02-37-09.257.log
14M -rw-r--r-- 1 root root 14M Jan 26 03:01 kube-apiserver.log

Also seeing this every 1 minute in coredns log
/var/log/pods/kube-system_coredns-5dc469f6f-8bdfl_7d43881e-8f67-4958-a628-b3ad9ee2f3ba/coredns/6.log

Spoiler
2021-01-26T03:05:59.039903212Z stdout F [INFO] 10.0.11.129:56047 - 60281 "A IN vcenter.incendiary.local.vmware-system-kubeimage.svc.cluster.local. udp 84 false 512" NXDOMAIN qr,aa,rd 177 0.000139909s
2021-01-26T03:05:59.039919346Z stdout F [INFO] 10.0.11.129:39912 - 43457 "AAAA IN vcenter.incendiary.local.vmware-system-kubeimage.svc.cluster.local. udp 84 false 512" NXDOMAIN qr,aa,rd 177 0.000054804s
2021-01-26T03:05:59.040228698Z stdout F [INFO] 10.0.11.129:37068 - 3792 "A IN vcenter.incendiary.local.svc.cluster.local. udp 60 false 512" NXDOMAIN qr,aa,rd 153 0.000073066s
2021-01-26T03:05:59.040235743Z stdout F [INFO] 10.0.11.129:50835 - 18258 "AAAA IN vcenter.incendiary.local.svc.cluster.local. udp 60 false 512" NXDOMAIN qr,aa,rd 153 0.000041533s
2021-01-26T03:05:59.040570154Z stdout F [INFO] 10.0.11.129:51421 - 42079 "A IN vcenter.incendiary.local.cluster.local. udp 56 false 512" NXDOMAIN qr,aa,rd 149 0.000044842s
2021-01-26T03:05:59.040580493Z stdout F [INFO] 10.0.11.129:54895 - 23787 "AAAA IN vcenter.incendiary.local.cluster.local. udp 56 false 512" NXDOMAIN qr,aa,rd 149 0.000036734s

Seeing more bad certificate errors.
/var/log/pods/vmware-system-license-operator_vmware-system-license-operator-controller-manager-7d8c9679qkmmk_81254b45-d221-49ef-a4dd-ce2e15ad5f56/manager/0.log

Spoiler
2021-01-25T10:02:33.134858909Z stderr F 2021/01/25 10:02:33 http: TLS handshake error from 10.0.11.131:33250: remote error: tls: bad certificate
2021-01-25T10:02:33.233078918Z stderr F 2021/01/25 10:02:33 http: TLS handshake error from 10.0.13.131:16827: remote error: tls: bad certificate
2021-01-25T10:02:33.234260788Z stderr F 2021/01/25 10:02:33 http: TLS handshake error from 10.0.13.131:40001: remote error: tls: bad certificate
2021-01-25T10:02:33.234302377Z stderr F 2021/01/25 10:02:33 http: TLS handshake error from 10.0.13.131:54312: remote error: tls: bad certificate
2021-01-25T10:02:34.433934971Z stderr F 2021/01/25 10:02:34 http: TLS handshake error from 10.0.11.131:33266: remote error: tls: bad certificate

 

0 Kudos
depping
Leadership
Leadership

Let me raise this with a few folks, that sounds very extreme indeed.

0 Kudos
frozenak
Enthusiast
Enthusiast

Thank you. I would say 2.5TBW per 24 hours is a touch on the high side.

That is, if my maths are correct, it's 2.5TBW per day.

Would be interested to know if there is some sort of issues with .local domains.

In part, because of this VMware TAM Lab 067 video where they are using a .local domain.

https://www.youtube.com/watch?v=bQ4FcieZlZg

0 Kudos
depping
Leadership
Leadership

Agreed, engineering is looking into it.

0 Kudos
frozenak
Enthusiast
Enthusiast

Any updates? Any additional info I might be able to provide?

0 Kudos
depping
Leadership
Leadership

I have no update, other than it is being looked in to.

0 Kudos
depping
Leadership
Leadership

Issue has been confirmed in our engineering environment as well, it is being investigated and hopefully fixed in a future release. If you run this in production I would recommend filing an SR for it and submitting all details, that way it can be attached to the engineering issue.

frozenak
Enthusiast
Enthusiast

Good to hear it isn't something isolated to my environment. Thank you!
In my case this is just a homelab, so not a huge concern.

We do have intention of (eventually) deploying this at work though, which is also on a .local domain.
Not sure on that timeframe though, as we are still working to get the environment upgraded to 7.0U1.

Is the issue related to .local domains or something else?

0 Kudos
depping
Leadership
Leadership

No, it appears to be related to the log level for fluent. 

0 Kudos
FL0G
Enthusiast
Enthusiast

I had the same problem with the latest vSphere with Tanzu version (vCenter Server 7.0 U1d / v1.18.2-vsc0.0.7-17449972). I solved the problem by changing the fluentd loglevel from debug to warn in the Supervisor Control Plane. It was spamming debug messages:

[debug] [input:tail:tail.0] inode=526818 events: IN_MODIFY

See this for details:

https://www.virten.net/2021/02/vsphere-with-tanzu-supervisorcontrolplanevm-excessive-disk-write-io/

 

depping
Leadership
Leadership

Yes, that is indeed the way to get around the problem.

0 Kudos
Joffer77
Contributor
Contributor

Just found this on the problem: https://www.virten.net/2021/02/vsphere-with-tanzu-supervisorcontrolplanevm-excessive-disk-write-io/

I checked my system and we're having 8000IOPS from the supervisorVMs.. will create an SR instead of fixing it myself according to the blog.

0 Kudos