Guest Introspection Service Not Ready

epa80 · ‎09-20-2018

We have a VDI environment that we utilize Guest Introspection for, with Deep Security from Trend Micro. It seems like ever since our upgrade to vCenter/ESXi 6.5, along with NSX 6.4.1, we hit an issue with the GI service randomly getting into a state where it's not ready. The specific alarm is "Guest Introspection Service Not Ready". Now, we were used to seeing it before with ESXi 6.0 and NSX 6.4.0, where it was more of just a missed heartbeat or something and would quickly go away, but, in this scenario it sticks around. Not only that, it seems to affect performance. We actually were getting calls that our VDI environment was sluggish and it could only be resolved by shutting down the GI appliances. The resources on the hosts though were quite green, not even close to maxxing the memory/CPU.

We opened a ticket with support, and eventually we received a reply of:

The observed symptoms in the environment and logged messages are similar to this KB. We recommend to upgrade NSX to 6.4.2 If you want to implement work around then you need to follow these steps.

Attached file namespece_db.py

1) scp/transfer the script namespace_db.py to each host /tmp directory

1a) Migrate VMs off from the host

2) do 'cp /usr/lib/vmware/vsepmux/bin/namespace_db.py /usr/lib/vmware/vsepmux/bin/namespace_db_py.bak'

3)do 'rm /usr/lib/vmware/vsepmux/bin/namespace_db.py'

4) cp /tmp/namespace_db.py /usr/lib/vmware/vsepmux/bin/

5) check the status and restart the service

/etc/init.d/vShield-Endpoint-Mux status

/etc/init.d/vShield-Endpoint-Mux restart

/etc/init.d/vShield-Endpoint-Mux status

6) Please do steps 1 to 5 on each ESXi host

Well, we eventually did what the KB asked and upgraded NSX to 6.4.2. The issue followed. We then saw that VMware quickly released NSX 6.4.3. Gave that a shot, the issue remained. We never did try the workaround steps as I preferred not to. I tried upgrading the appliances from 6.4.2 to 6.4.3, as well as removing the service completely and re-deploying it, no luck.

The only trigger I can see is when we put a load on the hosts. We're deploying a linked clone Horizon environment to these clusters. When I deploy the GI service and the hosts are empty, all is green. As soon as I start to deploy the Horizon environment, they start to see the alarm. Funnily enough the alarm bounces between hosts. We see hosts that have less VMs than others get it, but then an hour later the alarm could move to the other hosts and clear from the prior ones. It's inconsistent.

I have a cluster right now seeing the issue. 6 hosts, Dell R730XDs, vSan, about 400 VMs. Right now the alarm is showing on host 1 (90 VMs) and host 6 (72VMs). It's not showing on host 2 (48), host 3 (72), host 4 (64), or host 5 (69).

Well on 2nd viewing there, I guess technically the top 2 counts are seeing it. But if that's the issue, and 70 VMs per host is the max, ugh. The VMs BTW have 2 vCPUs and 3GB of memory.

Would love to hear if anyone has seen this before. If I can answer any questions or clarify anything, I'd be glad to. Thanks in advance.

ATFink · ‎09-21-2018

epa80,

At this point, can you confirm whether you are still encountering performance issues or loss of AV protection at the time of the error message?

epa80 · ‎09-21-2018

ATFink,

We have a cluster in this data center that is all brand new Dell model R740xd. There are 42 hosts in total, and on those hosts, I cannot replicate the issue. It only seems to be on the r730 hosts. That being said, performance and protection on the 740 cluster seems normal. No alerts, no performance issues seen or reported.

For the r730 clusters, the AV protection from Deep Security is saying things are managed/normal. When the NSX GI issue is happening, I don't see an issue reflected in Deep Security at the same time (or at all really). In terms of performance, it's hard to say because we currently don't have users pointed at this cluster. Prior though, in a different cluster (730s again) where the identical issue was happening, indeed the users complained of sluggishness that could only be resolved by shutting down the GI appliances.

One thing to note: my 740 clusters did experience the issues at one point, but, I believe it may have been resolved by fixing an NTP setting on the NSX Managers. I haven't seen the issue there since applying the fix. And yes the setting is correct on the 730 clusters.

epa80 · ‎09-21-2018

Just double checked and yes, my Deep Security environment is showing no ill affects. If you looked only in that console, you'd never know there's an issue with the NSX Guest Introspection appliances. Green across the board.

sveld · ‎11-08-2018

Hi,

I've a similar issue in following an similar upgrade route; did you find a solution yet? I've opened a case for this with VMWare and Trend Micro but so far no solution yet.

Can you confirm you see on affected hosts in the /var/log/syslog.log the following messages?

--------------/var/log/syslog.log------------

2018-11-05T06:26:56Z EPSecMux[68899]: [ERROR] (EPSEC) [68912] popen failed for pidof NSX-Context-Engine. Errno: 28 (errno is not set on allocation failures). Strerror: No space left on

device.

2018-11-05T06:26:56Z EPSecMux[68899]: [ERROR] (EPSEC) [68912] Exception encountered while querying Context Engine status, EPSecPosixError@tid=68912: popen failed. errno: 28 (No space

left on device)

--------------end /var/log/syslog.log------------

If I run into this issue on a specific host an restart /etc/init.d/vShield-Endpoint-Mux this solved the introspection problem and these messages are no longer logged in syslog.log.

Thx, Sebastiaan

Best regards, Sebastiaan Veld

epa80 · ‎11-12-2018

The answer at this point from support has been that the GI service is not ready message is cosmetic. We were provided this KB as a point of reference:

https://kb.vmware.com/s/article/58845

Apparently it's something that will be fixed in 6.4.4. I am still skeptical, as we have also continued to see issues with Trend Micro. We're awaiting the 6.5 Update 2 Patch 3 release this month, and after that I'd like to see what we get. It's supposed to resolve a heap memory issue in our ESXi version, and I'd love to see if it relates to this.

sveld · ‎11-14-2018

Thx, I found that article too but that is not the issue we run into here. Support is still looking into the logs. I'll keep you posted case anything interesting pops up.

Best regards, Sebastiaan Veld

sveld · ‎11-19-2018

Hi @epa80,

I just got confirmation from VMWare that this issue is supposted to be solved in NSX 6.4.4, so fingers crosses. In the meantime I'll restart the MUX driver on a regular base (once a week).

-Sebas

Best regards, Sebastiaan Veld

epa80 · ‎12-17-2018

Looks like 6.4.4 has landed. Resolved issues below. The first bullet sounds good to me.

Resolved Issues

The resolved issues are grouped as follows.

General Resolved Issues

Fixed Issue 2089858: GI SVM memory limits with high network throughputHigh memory observed in the GI SVM. Loss of connectivity to the NSX Manager may also occur. Customer workloads running on affected hosts may be impacted.
Fixed Issue 2094345: Purple screen on host when flow collection is turned on in NSX.Host crashed with purple screen, leading to loss of VMs data.
Fixed Issue 2177097: When using API call /api/2.0/vdn/config/segments to create a pool with 1 Segment ID it fails with, "Segment id is out of range, valid range is 5000-16777215"When using the API /api/2.0/vdn/config/segments, if you provide the same start and end value when creating a single value segment, it fails with an error.
Fixed Issue 2178339: rsyslog 8.15.0-7.ph1 removed ExecReload line in systemd service file causing /var/log/syslog and /var/log/messages to not logrotate properlyThis causes /var/log partition to take up 100% disk space so new logs cannot be written.
Fixed Issue 2188753: Inventory sync results in “duplicate key value violates unique constraint” exception when multiple mappings exist for vNIC in domain_object_relationships tableInventory sync keeps failing, resulting in vCenter and NSX becoming out of sync.
Fixed Issue 2194374: GI USVM SSH not workingSome ssh keys, such as RSA, DSA, ECDSA and ED25519 are not generated automatically.
Fixed Issue 2134192: Error is thrown when there is no port on the switchIf a physical switch on a hardware gateway has no port, NSX throws an error when trying to get the port from the switch. You will see the error, "Unable to fetch inventory information" while trying to get the port.
Fixed Issue 2183584: Security groups created within an Application Rule Manager session are not displayed in the drop down menu of the applied-to column of the recommended ruleSecurity groups created within an Application Rule Manager session are not displayed in the drop down menu of the applied-to column of the recommended rule.
Fixed Issue 2210313: In “Force sync” workflow, security policy inheritance is not consideredIf a Security policy is inherited by another child policy, after force sync, the child policy's applied SGs are not considered as Policy security groups for parent policy. Firewall rules from parent policy are not correctly applied to PSGs of child policy.
Fixed Issue 2216582: Some VMs lose AV protectionDue to high memory usage and changing VM UUID, unable to apply new config for changed VM. Some VMs lose AV protection.
Fixed Issue 2195346: VDR Instances are erased from hosts of Secondary Manager after a delete and add of controller clusterTraffic loss of around 40 seconds after the delete and add of controller cluster.
Fixed Issue 2213199: VM page under logical switch is not loadingUser interface looks frozen. All the VMs for a logical switch cannot be viewed.
Fixed Issue 2172254: When only bridging is configured, Active-Active is reported during redeploy of an active applianceThe same IP address might be reported by two appliances (active-active scenario). However, if even a single uplink is configured, or routing is configured on DLR, then this issue will not be seen. Traffic drop and active-active scenario will be encountered if no uplinks are configured, dynamic routing is not enabled on DLR and only L2 bridge is configured.
Fixed Issue 2018917: High count of NSX backup files causes unpredictable behaviorHigh count of NSX backup files causes unpredictable behavior, including NSX deployment failures, unresponsive user interface.

Logical Networking and NSX Edge Resolved Issues

Fixed Issue 2158380: SSL VPN DNS suffix changes not reverted after logoutDNS suffix changes on adapter are not reverted back after sslvpn client logs out. DNS resolution may not work as expected.
Fixed Issue 2177891: Nexthop learned from Edge BGP may not be as expected after an Edge failure is recoveredNexthop learned from Edge BGP may not be as expected after a Edge failure is recovered. There may be packet loss depending on the customer topology.
Fixed Issue 2178771: SSLVPN interface na0 losing gateway IP addressSSLVPN na0 interface, which acts as a gateway for clients connected, loses its ip address, causing traffic disruption from client. Client traffic does not flow as expected.
Fixed Issue 2192834: No solution provided when trying to delete an appliance configuration without using deployAppliance flagError message is not clear and does not provide a solution.
Fixed Issue 2126743: IP address of VMs not published to addrset when VMs added to security groupWhen a VM is added to securitygroup, the IP of the VM doesn't show up in the addrset. Traffic fails, as the proper policy is not applied to the VM, even if it is part of a properly-configured security group.
Fixed Issue 2197442: BGP neighbor password of UDLR is not getting replicated to secondary NSX ManagerRoutes are not learned from the configured neighbor.
Fixed Issue 2209469: Section level create/update operation does not publish to EdgeSection update succeeds for Edge but rules not updated in Dataplane.
Fixed Issue 2207483: High latency for both E-W and N-S routed trafficTxWorld of VM generating routed traffic takes 100% CPU resulting in high latency.
Fixed Issue 2192486: Multicast south to north traffic forwarding will stop after upgrading from NSX 6.4.2 to 6.4.3 if underlying unicast routing is through static routes and HA mode is disabledIf you are running multicast streams between source inside NSX and receivers outside NSX and the underlying unicast routing is through static routes, and you upgrade from NSX 6.4.2 to 6.4.3, the traffic forwarding will stop for the existing multicast streams. Multicast streams created after the upgrade are not affected and are forwarded.
Fixed Issue 2185738: Unable to use an interface if it has an IPv6 addressInternal interfaces that have an IPv4 address are listed, but if they also have an IPv6 address, an error is generated. DNS Forwarder Configuration cannot be applied to an interface containing an IPv6 Address.
Fixed Issue 2215061: NSX Edge losing HA state after DCN publish as same ApplianceConfig sent to both VMs when load balancer is configured with grouping objects and HA is enabledHA is disconnected, leading to split brain.
Fixed Issue 2220327: Unable to set system managed resource reservation for an EdgeChoosing custom resource reservation for Edge, System Managed Resource Reservation is not displayed on user interface menu.
Fixed Issue 2220549: Firewall rule validation consuming grouping objects takes time leading to user interface timeoutEdge Firewall configuration change where firewall rules consuming grouping objects have been configured takes time in excess of 2 minutes leading to UI timeout.

NSX Manager Resolved Issues

Fixed Issue 2220325: Tech support bundle contains files with plain-text passwordsPassword for postgres, rabbitmq are available in plain-text config files of tech support bundle and can be retrieved.
Fixed Issue 2208178: After NSX Manager reboot, NSX VIBs installation task is shown as continuously running in the UI on the Host Preparation tabEAM does not start the installation of NSX VIBs.

All

Guest Introspection Service Not Ready

Resolved Issues