msiem's Posts

Greetings, We've recently switched from 6.7 to 7.0.3, 19482537 and we had never had any similar problems with vMotion before. When a network failure occurs and it affects ESXi hosts, they go back to... See more...
Greetings, We've recently switched from 6.7 to 7.0.3, 19482537 and we had never had any similar problems with vMotion before. When a network failure occurs and it affects ESXi hosts, they go back to normal as soon as Cisco ports or the entire network environment re-balances. Yesterday we had a problem to vMotion several VMs onto two ESXi hosts after such network incidents. I looked through vpxa, hostd and vmkernel logs and found: Failed waiting for data. Error 195887167. Connection closed by remote host, possibly due to timeout. VMotionStream [-1407778881:4151649780786036937] failed to read stream keepalive: Connection closed by remote host, possibly due to timeout cpu34:2591196)WARNING: Migrate: 6460: 4151649780786036937 Migration considered a failure by the VMX. It is most likely a timeout, but check the VMX log for the true error. There are a lot of entries including: Cannot open file "/vmfs/volumes/5ed12ccc-e4651386-16a9-bc97e148c8ec/VMXXX/VMXXX.vmx": Device or resource busy OR: il3: 4994: Lock failed on file: VMXXX.vmx on vol 'ST0CML1-VMFS2' with FD: <FD c57 r1> Based on some Cisco log entries I decided to replace SFP modules in one ESXi host (also replaced the corresponding module in Cisco) - still, was not able to vMotion any VMs. The only workaround seems to be a reboot - after the reboot, problems with vMotion are gone. It means that there are no configuration problems (MTU mismatch, etc.). Not a single VM stucks at 20% again while moving it onto another host. At this moment, it's the only workaround - maybe there's a bug in 7.0.3? Regards,
Greetings, I've been reading about NFS & VMware and the best practices, though still find some of the aspects kind of confusing especially regarding network configuration. We use iSCSI on a daily ba... See more...
Greetings, I've been reading about NFS & VMware and the best practices, though still find some of the aspects kind of confusing especially regarding network configuration. We use iSCSI on a daily basis, but there's an opportunity to backup old and deprecated VMs (plenty of them) using Synology NAS via NFS protocol. I already read the following thread here and the article on why to use vmkernel with NFS, but it still seems as if it was not mandatory and was only about shortening the path and avoiding pushing the traffic through a router.   1. We use separate vmks for Mgmt, vMotion (vSwitch0), and two iSCSI fault domains (vSwitch1 & vSwitch2) and they all share the same default TCP/IP stack. The routing for the default stack has vmkernel gateway set along with DNS servers - if so: Can I assign the default TCP/IP stack to a new vmk with a new, NAS-dedicated network but without setting the default gateway for it? If the default stack is assigned with external DNS servers it means that theoretically, I should be able to reach external NAS from the new vmk (NAS) the same way vmk0 (Mgmt) communiates with DNS servers (different subnets) even if the default stack has no default gateway in the same subnet as NAS.   2. Is creating a new TCP/IP (custom) stack and a new vmk for NAS the only right way to attach NAS storage to an ESXi host?   3. (An abstract example:) If I can communicate with server B from server A and they are in different subnets (on condition that the routing is set), can I theoretically communicate with the NAS server using vmk0 (Mgmt) only? It's kind of the same situation provided that the routing between the Mgmt VLAN and NAS VLAN is set and let's say I don't care about the routing congestion. Regards,  
Good morning to all of you, We're finishing up configuring our new production VMware 6.7 based cluster, however, we're experiencing kind of strange issue associated with Alarms and particular ... See more...
Good morning to all of you, We're finishing up configuring our new production VMware 6.7 based cluster, however, we're experiencing kind of strange issue associated with Alarms and particular reset rules. The case is that we don't get email notifications when 'Network connectivity lost' error is normalized and turns Green after we get any ESXi host back online. The same applies to the situation when we shutdown one port from a particular port-channel group (then network uplik redundancy is lost). We get email notifications informing us that network connectivity or uplik redundancy is lost, but our reset rules don't seem to work. 1. We verified if some emails were blocked on the mail server, but according to the people responsible for this service - that's not the case. 2. I found this article and thought maybe there's an issue with the notification service in general. 3. Alarm rules work just fine - we get email notifications (as written above). 4. We didn't change anything in Network connectivity lost Reset rules except enabling SNMP traps and Send email notifications option and setting the proper email address. When we go to a particular host and look at the Events we have the following set of events: So there's no email notification after Network uplink redundancy lost changes from Red to Green. We're wondering if this might have to do with the notification service itself. [UPDATE, July 7th] Today I carried out another test, and something really weird happened. 1. I shut down one port combined into a port-channel to trigger 'redundancy lost' alarm, and I got the notification - great! 2. I restored the previous status of that particular switch port so that the redundancy was recovered, notification was gone in vCenter, however, there was no email notification that the status had changed from Red to Green. 3. I decided to display hostd.log in 'live' mode to see what was (and is) happening. 4. At that point both vmnic0 and vmnic4 were online (port-channel was fully operational). 5. Then I noticed (in hostd.log file) that ESXi was - BY TURNS - generating the following events: esx.problem.net.redundancy.lost and esx.clear.net.redundancy.restored - it seemed to be like an infinite loop. Those log entries were being generated over and over again. It was like that up to the moment when I decided to restart management agents. Along the way we got two additional email notifications informing that redundancy was lost even though the port-channel was already up and running. I'm pretty sure, it should not work like that. [UPDATE, July 9th] Today, I carried out another test against one of our ESXi hosts, and when I shut down one port in port-channel there was not a single alarm sent on email - no 'lost' event, no 'restored' event (only dell-related events appeared in vCenter: link down / link up). The other ESXi host worked better in this respect, but there was no email message after restoring redundancy. I've already collected some log files, made some screenshots, and I think that this post is pretty clear and doesn't require any further explanation. I think the Support team should look into this. Message was edited by: Marcin Siemion