VMware ESXi Version 5.5.0
Build 1746018
HP ProLiant DL360e Gen8
48GB RAM
2x XEON Quadcore i7
1. VM Server2012 R2 as Domaincontroller 16 GB RAM
2. VM Server2012 R2 as TerminalServer 16 GB RAM
Hello, since may 2017 i have an issue with a guest vm on ESXi Version 5.5.0.
The guest stops randomly working and "freeze" within more or less 24 hours.
The eventlog doesn't note any error.
If the incident occures and i'm logged in via RDP the session isn't terminated.
ipconfig can't be executed and there is no reaction within the cmd process.
Surprisingly i can reach the Terminalserver via ping and get an echo.
The VM1 is still running without problems.
If the freeze vsphere-client it's impossible to use the console and the status of the running vmware-tools changes to "not running".
What i've done, was to uninstall the KBs for .NET 4.x and reinstalled the vmware-tools - without any effect.
As Antivirus-Software is ESET on both VMs installed.
Does anyone have a hint, how to find the reason for this issue ?
Regards
Mike
Freeze-State:
~ # ps -s | grep TS2
6135153 vmm0:TS2 WAIT IDLE 0-7
6135155 vmm1:TS2 WAIT IDLE 0-7
6135156 vmm2:TS2 WAIT IDLE 0-7
6135157 vmm3:TS2 WAIT IDLE 0-7
6135158 4912237 vmx-vthread-7:TS2 WAIT UFUTEX 0-7 /bin/vmx
6135159 4912237 vmx-mks:TS2 WAIT UPOL 0-7 /bin/vmx
6135160 4912237 vmx-svga:TS2 WAIT SEMA 0-7 /bin/vmx
6135163 4912237 vmx-vcpu-0:TS2 WAIT IDLE 0-7 /bin/vmx
6135164 4912237 vmx-vcpu-1:TS2 WAIT IDLE 0-7 /bin/vmx
6135165 4912237 vmx-vcpu-2:TS2 WAIT IDLE 0-7 /bin/vmx
6135166 4912237 vmx-vcpu-3:TS2 WAIT IDLE 0-7 /bin/vmx
Normal:
~ # ps -s | grep TS2
6513971 vmm0:TS2 WAIT IDLE 0-7
6513975 vmm1:TS2 WAIT IDLE 0-7
6513976 vmm2:TS2 RUN NONE 0-7
6513977 vmm3:TS2 WAIT IDLE 0-7
6513978 6513970 vmx-vthread-7:TS2 WAIT UFUTEX 0-7 /bin/vmx
6513979 6513970 vmx-vthread-8:TS2 WAIT UFUTEX 0-7 /bin/vmx
6513980 6513970 vmx-mks:TS2 WAIT UPOL 0-7 /bin/vmx
6513981 6513970 vmx-svga:TS2 WAIT SEMA 0-7 /bin/vmx
6513982 6513970 vmx-vcpu-0:TS2 WAIT IDLE 0-7 /bin/vmx
6513983 6513970 vmx-vcpu-1:TS2 WAIT IDLE 0-7 /bin/vmx
6513984 6513970 vmx-vcpu-2:TS2 RUN NONE 0-7 /bin/vmx
6513985 6513970 vmx-vcpu-3:TS2 WAIT IDLE 0-7 /bin/vmx
Running-State:
If your ESXi host uses a Broadcom 57xx chipset NIC, you may want to look at this https://kb.vmware.com/kb/2035701
Hi bluefirestorm, thanks for your reply.
The DL360 has an Intel I350 Gigabit Controller.
I removed the antivirus software "ESET" and replaced it with the bitdefender from Microsoft.
Just in case if the software is the reason..
Any scheduled backup between time you face issue !
Within the VM2 "TerminalServer" the MS Server Backup starts at 04:00pm.
For the external Backup is "Veeam Backup & Replication" starting at 11:30pm.
The internal backup finish mostly without problems, except the time when the guest-os "freezes".
It's unpredictable when the "freeze" will occur.
To rule out any underneath storage issue check if you are seeing any performance deterioration message in vmkernel.log.
"<storage_device> performance has deteriorated" message in ESXi (2007236) | VMware KB
Check the DAVG/cmd value for any latency issues.
All arrays perform differently, however, DAVG/cmd, KAVG/cmd, and GAVG/cmd should not exceed more than 10 milliseconds (ms) for sustained periods of time.
If you found this or any other answer helpful, please consider the use of the Correct or Helpful to award points.
Best Regards,
Deepak Koshal
CNE|CLA|CWMA|VCP4|VCP5|CCAH
Hi dekoshal, thanks for your reply.
At first i want to report that, the "freezing" issue is still happening after the installation.
Your hint to look into the vmkernel.log gives the following messages:
2017-07-13T19:04:28.004Z cpu0:32789)WARNING: ScsiDeviceIO: 1223: Device naa.600508b1001c2809078a6fc03be126ab performance has deteriorated. I/O latency increased from average value of 2457 microseconds to 49851 microseconds.
2017-07-13T23:54:15.122Z cpu3:32792)WARNING: ScsiDeviceIO: 1223: Device naa.600508b1001c2809078a6fc03be126ab performance has deteriorated. I/O latency increased from average value of 2463 microseconds to 56565 microseconds.
2017-07-15T19:00:12.547Z cpu2:32791)WARNING: ScsiDeviceIO: 1223: Device naa.600508b1001c2809078a6fc03be126ab performance has deteriorated. I/O latency increased from average value of 2446 microseconds to 59375 microseconds.
The backup via Veeam run's in a row starting with VM1 and so on.
The internal backup runs at 2:00pm UTC
The veeam backup runs at 9:30pm UTC
Hi Mike,
If you are using E1000/e1000e nic then you might face some kind of nic freeze try to change the nic type.
If you are using vmxnet3 type and get freeze try to increase the rx buffer size in the vNIC.
Hi Dee006, thanks for your reply.
Today i did an update for the VMware to ESXi 5.5.0 Update 3, updated the vmware-tools inside the guests and changed the buffersize/ ring#1 size to the maximum allowed level
- according to the entry you mentioned in the kb2039495. The following 24hs will show, if the problem will occur again.
Regards Mike
Thank you for the update.
As mentioned in kb article The device latency may increase due to one of these reasons:
If its one the first three you might want to check the storage array side to log to investigate further and if its because of overload condition you need to make sure that the
backup job do not run at production hours, HBA driver on the host's are updated, queue depth is set to same across the board from from host to storage, Virtual Machine's are balanced on the datastore not by just number of vm's but by also considered by number of latency sensitive vm's, Look into guest OS level logs for more clues and hint, check what type of application is running on the vm and based on that revisit the vm's H/W configuration if required (For more info check out below link. Comment#3)
If you found this or any other answer helpful, please consider the use of the Correct or Helpful to award points.
Best Regards,
Deepak Koshal
CNE|CLA|CWMA|VCP4|VCP5|CCAH
Hi Mptter,
Did you find anything significant from esxtop?
Best Regards,
Deepak Koshal
CNE|CLA|CWMA|VCP4|VCP5|CCAH
Hi dekoshal,
after the update to VMware ESXi 5.5.0 build-3116895 all guest os running stable for > 24hs but the client wishes that the TS-guest has to reboot before the working hours begin.
It's just the order until the coming friday. I wrote i script and let in run via cron. The backups itself are running only in non-working hours at the night.
In esxtop is was not really a prominence visible. The DAVG/cmd is during the working hours 10-12 and rises during the backup to 25.
Another problem occured and affects the backup. The vSphere Client was also updated and reports every day that a consolidation is necessary - for the DC and TS.
Yesterday it was just the DC and i let the consolidation run and it finishedd successfully but today the consolidation involves the 2 guests.
In the snapshot-manager is no snapshot visible and the disks are in use, changed from DC.vmdk to DC-000001.vmdk.
The folder for TS contains the following entries:
/vmfs/volumes/54295773-23b4ec4d-aec7-c4346bac8728/TS # ls -lah
total 59844640
drwxr-xr-x 1 root root 3.3K Jul 18 10:39 .
drwxr-xr-t 1 root root 1.9K Jul 12 20:20 ..
-rw------- 1 root root 6.3M Jul 18 10:39 TS-000002-ctk.vmdk
-rw------- 1 root root 16.4M Jul 18 10:41 TS-000002-delta.vmdk
-rw------- 1 root root 369 Jul 18 10:39 TS-000002.vmdk
-rw-r--r-- 1 root root 27 Jun 22 07:48 TS-4ba1ba0e.hlog
-rw------- 1 root root 16.0G Jul 18 04:10 TS-4ba1ba0e.vswp
-rw------- 1 root root 6.3M Jul 18 10:39 TS-ctk.vmdk
-rw------- 1 root root 200.0G Jul 18 10:39 TS-flat.vmdk
-rw------- 1 root root 8.5K Jul 18 10:39 TS.nvram
-rw------- 1 root root 571 Jul 18 10:39 TS.vmdk
-rw-r--r-- 1 root root 79 Jul 18 10:39 TS.vmsd
-rwxr-xr-x 1 root root 4.0K Jul 18 10:39 TS.vmx
-rw------- 1 root root 0 Jul 18 04:10 TS.vmx.lck
-rw-r--r-- 1 root root 3.2K Jul 17 20:17 TS.vmxf
-rwxr-xr-x 1 root root 4.0K Jul 18 10:39 TS.vmx~
-rw-r--r-- 1 root root 17.7M Jul 12 19:33 vmware-40.log
-rw-r--r-- 1 root root 58.0M Jul 15 19:02 vmware-41.log
-rw-r--r-- 1 root root 207.8K Jul 16 11:01 vmware-42.log
-rw-r--r-- 1 root root 259.5K Jul 17 20:17 vmware-43.log
-rw-r--r-- 1 root root 178.2K Jul 17 20:30 vmware-44.log
-rw-r--r-- 1 root root 376.2K Jul 18 03:55 vmware-45.log
-rw-r--r-- 1 root root 684.6K Jul 18 10:39 vmware.log
-rw------- 1 root root 129.0M Jul 18 04:10 vmx-TS-1268890126-1.vswp
For the DC:
/vmfs/volumes/54295773-23b4ec4d-aec7-c4346bac8728/DC # ls -lah
total 341381152
drwxr-xr-x 1 root root 3.8K Jul 18 07:36 .
drwxr-xr-t 1 root root 1.9K Jul 12 20:20 ..
-rw------- 1 root root 4.7M Jul 18 07:36 DC-000001-ctk.vmdk
-rw------- 1 root root 5.8G Jul 18 10:42 DC-000001-delta.vmdk
-rw------- 1 root root 369 Jul 18 07:36 DC-000001.vmdk
-rw-r--r-- 1 root root 27 Oct 14 2014 DC-4ba1b7ee.hlog
-rw------- 1 root root 15.0G Jul 16 13:51 DC-4ba1b7ee.vswp
-rw-r--r-- 1 root root 13 Nov 13 2016 DC-aux.xml
-rw------- 1 root root 4.7M Jul 18 07:35 DC-ctk.vmdk
-rw------- 1 root root 300.0G Jul 18 07:35 DC-flat.vmdk
-rw------- 1 root root 8.5K Jul 18 07:36 DC.nvram
-rw------- 1 root root 571 Jul 18 07:35 DC.vmdk
-rw-r--r-- 1 root root 79 Jul 18 07:35 DC.vmsd
-rwxr-xr-x 1 root root 3.5K Jul 18 07:35 DC.vmx
-rw------- 1 root root 0 Jun 15 11:21 DC.vmx.lck
-rw-r--r-- 1 root root 3.2K Jul 18 10:14 DC.vmxf
-rwxr-xr-x 1 root root 3.5K Jul 18 07:35 DC.vmx~
-rw------- 1 root root 6.3M Jul 18 07:36 DC_1-ctk.vmdk
-rw------- 1 root root 200.0G Jul 18 10:41 DC_1-flat.vmdk
-rw------- 1 root root 575 Jul 18 07:36 DC_1.vmdk
-rw-r--r-- 1 root root 91.9M Feb 9 20:11 vmware-21.log
-rw-r--r-- 1 root root 145.8M Jun 7 09:11 vmware-22.log
-rw-r--r-- 1 root root 219.0K Jun 7 09:22 vmware-23.log
-rw-r--r-- 1 root root 221.4K Jun 7 10:21 vmware-24.log
-rw-r--r-- 1 root root 841.1K Jun 15 10:26 vmware-25.log
-rw-r--r-- 1 root root 2.3M Jul 16 11:16 vmware-26.log
-rw-r--r-- 1 root root 2.1M Jul 18 10:37 vmware.log
-rw------- 1 root root 129.0M Jul 16 12:19 vmx-DC-1268889582-1.vswp
Update:
The consolidation successfully finished after i shutdown the veeam service on the 3rd guest.
Issue where you see the snapshot in the virtual machine folder but not seeing it in the snapshot manager occurs when vmware receives the command to remove the snapshot (manual or automatic ) but could not remove it because of time out due to latency or any other reason. So in order to let the VMware administrator know that there are leftover snapshot on the VM which needs to be removed it show as message on the vm summary stating virtual machine disk consolidation is required. Once disk consolidation is completed leftover delta disk is merged into base disk.
Above mentioned issue could arise because of many reasons such as :
1. Compatibility issue between vsphere and backup solution. Check if backup solution requires any firmware update or patching.
2. Created multiple backup job with small batch of vm's and schedule it in a way that all backup job do not trigger at the same time.
3. Make sure esxi HBA driver is updated.
If you found this or any other answer helpful, please consider the use of the Correct or Helpful to award points.
Best Regards,
Deepak Koshal
CNE|CLA|CWMA|VCP4|VCP5|CCAH