VMware Cloud Community
mptter
Contributor
Contributor

VMware ESXi 5.5.0 - Guest OS Windows Server 2012 R2 freezing

VMware ESXi Version 5.5.0

Build 1746018

HP ProLiant DL360e Gen8

48GB RAM

2x XEON Quadcore i7

1. VM Server2012 R2 as Domaincontroller 16 GB RAM

2. VM Server2012 R2 as TerminalServer 16 GB RAM

Hello, since may 2017 i have an issue with a guest vm on ESXi Version 5.5.0.

The guest stops randomly working and "freeze" within more or less 24 hours.

The eventlog doesn't note any error.

If the incident occures and i'm logged in via RDP the session isn't terminated.

ipconfig can't be executed and there is no reaction within the cmd process.

Surprisingly i can reach the Terminalserver via ping and get an echo.

The VM1 is still running without problems.

If the freeze  vsphere-client it's impossible to use the console and the status of the running vmware-tools changes to "not running".

What i've done, was to uninstall the KBs for .NET 4.x and reinstalled the vmware-tools - without any effect.

As Antivirus-Software is ESET on both VMs installed.

Does anyone have a hint, how to find the reason for this issue ?

Regards

Mike

Freeze-State:

~ # ps -s | grep TS2

6135153      vmm0:TS2         WAIT   IDLE   0-7

6135155      vmm1:TS2         WAIT   IDLE   0-7

6135156      vmm2:TS2         WAIT   IDLE   0-7

6135157      vmm3:TS2         WAIT   IDLE   0-7

6135158 4912237 vmx-vthread-7:TS2 WAIT   UFUTEX 0-7  /bin/vmx

6135159 4912237 vmx-mks:TS2      WAIT   UPOL   0-7  /bin/vmx

6135160 4912237 vmx-svga:TS2     WAIT   SEMA   0-7  /bin/vmx

6135163 4912237 vmx-vcpu-0:TS2   WAIT   IDLE   0-7  /bin/vmx

6135164 4912237 vmx-vcpu-1:TS2   WAIT   IDLE   0-7  /bin/vmx

6135165 4912237 vmx-vcpu-2:TS2   WAIT   IDLE   0-7  /bin/vmx

6135166 4912237 vmx-vcpu-3:TS2   WAIT   IDLE   0-7  /bin/vmx

Normal:

~ # ps -s | grep TS2

6513971      vmm0:TS2         WAIT   IDLE   0-7

6513975      vmm1:TS2         WAIT   IDLE   0-7

6513976      vmm2:TS2         RUN    NONE   0-7

6513977      vmm3:TS2         WAIT   IDLE   0-7

6513978 6513970 vmx-vthread-7:TS2 WAIT   UFUTEX 0-7  /bin/vmx

6513979 6513970 vmx-vthread-8:TS2 WAIT   UFUTEX 0-7  /bin/vmx

6513980 6513970 vmx-mks:TS2      WAIT   UPOL   0-7  /bin/vmx

6513981 6513970 vmx-svga:TS2     WAIT   SEMA   0-7  /bin/vmx

6513982 6513970 vmx-vcpu-0:TS2   WAIT   IDLE   0-7  /bin/vmx

6513983 6513970 vmx-vcpu-1:TS2   WAIT   IDLE   0-7  /bin/vmx

6513984 6513970 vmx-vcpu-2:TS2   RUN    NONE   0-7  /bin/vmx

6513985 6513970 vmx-vcpu-3:TS2   WAIT   IDLE   0-7  /bin/vmx

Running-State:

Reply
0 Kudos
12 Replies
bluefirestorm
Champion
Champion

If your ESXi host uses a Broadcom 57xx chipset NIC, you may want to look at this https://kb.vmware.com/kb/2035701

Reply
0 Kudos
mptter
Contributor
Contributor

Hi bluefirestorm, thanks for your reply.

The DL360 has an Intel I350 Gigabit Controller.

I removed the antivirus software "ESET" and replaced it with the bitdefender from Microsoft.

Just in case if the software is the reason..

Reply
0 Kudos
vijayrana968
Virtuoso
Virtuoso

Any scheduled backup between time you face issue !

Reply
0 Kudos
mptter
Contributor
Contributor

Within the VM2 "TerminalServer" the MS Server Backup starts at 04:00pm.

For the external Backup is  "Veeam Backup & Replication" starting at 11:30pm.

The internal backup finish mostly without problems, except the time when the guest-os "freezes".

It's unpredictable when the "freeze" will occur.

Reply
0 Kudos
dekoshal
Hot Shot
Hot Shot

To rule out any underneath storage issue check if you are seeing any performance deterioration message in vmkernel.log.

"<storage_device> performance has deteriorated" message in ESXi (2007236) | VMware KB

Check the DAVG/cmd value for any latency issues.

All arrays perform differently, however, DAVG/cmd, KAVG/cmd, and GAVG/cmd should not exceed more than 10 milliseconds (ms) for sustained periods of time.

Using esxtop to identify storage performance issues for ESX / ESXi (multiple versions) (1008205) | V...

If you found this or any other answer helpful, please consider the use of the Correct or Helpful to award points.

Best Regards,

Deepak Koshal

CNE|CLA|CWMA|VCP4|VCP5|CCAH

mptter
Contributor
Contributor

Hi dekoshal, thanks for your reply.

At first i want to report that, the "freezing" issue is still happening after the installation.

Your hint to look into the vmkernel.log gives the following messages:

2017-07-13T19:04:28.004Z cpu0:32789)WARNING: ScsiDeviceIO: 1223: Device naa.600508b1001c2809078a6fc03be126ab performance has deteriorated. I/O latency increased from average value of 2457 microseconds to 49851 microseconds.

2017-07-13T23:54:15.122Z cpu3:32792)WARNING: ScsiDeviceIO: 1223: Device naa.600508b1001c2809078a6fc03be126ab performance has deteriorated. I/O latency increased from average value of 2463 microseconds to 56565 microseconds.

2017-07-15T19:00:12.547Z cpu2:32791)WARNING: ScsiDeviceIO: 1223: Device naa.600508b1001c2809078a6fc03be126ab performance has deteriorated. I/O latency increased from average value of 2446 microseconds to 59375 microseconds.

The backup via Veeam run's in a row starting with VM1 and so on.

The internal backup runs at 2:00pm UTC

The veeam backup runs at 9:30pm UTC

Reply
0 Kudos
Dee006
Hot Shot
Hot Shot

Hi Mike,

If you are using E1000/e1000e nic then you might face some kind of nic freeze try to change the nic type.

https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=21099...

If you are using vmxnet3 type and get freeze try to increase the rx buffer size in the vNIC.

https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=20394...

Reply
0 Kudos
mptter
Contributor
Contributor

Hi Dee006, thanks for your reply.

Today i did an update for the VMware to ESXi 5.5.0 Update 3, updated the vmware-tools inside the guests and changed the buffersize/ ring#1 size to the maximum allowed level

- according to the entry you mentioned in the kb2039495. The following 24hs will show, if the problem will occur again.

Regards Mike

Reply
0 Kudos
dekoshal
Hot Shot
Hot Shot

Thank you for the update.

As mentioned in kb article The device latency may increase due to one of these reasons:

  • Changes made on the target
  • Disk or media failures
  • Failover
  • Overload conditions on the device

If its one the first three you might want to check the storage array side to log to investigate further and if its because of overload  condition you need to make sure that the

backup job do not run at production hours, HBA driver on the host's are updated, queue depth is set to same across the board from from host to storage, Virtual Machine's are balanced on the datastore not by just number of vm's but by also considered by number of latency sensitive vm's, Look into guest OS level logs for more clues and hint, check what type of  application is running on the vm and based on that revisit the vm's H/W configuration if required (For more info check out below link. Comment#3)

ESXi datastore configuration

If you found this or any other answer helpful, please consider the use of the Correct or Helpful to award points.

Best Regards,

Deepak Koshal

CNE|CLA|CWMA|VCP4|VCP5|CCAH

Reply
0 Kudos
dekoshal
Hot Shot
Hot Shot

Hi Mptter,

Did you find anything significant from esxtop?

Best Regards,

Deepak Koshal

CNE|CLA|CWMA|VCP4|VCP5|CCAH

Reply
0 Kudos
mptter
Contributor
Contributor

Hi dekoshal,

after the update to VMware ESXi 5.5.0 build-3116895 all guest os running stable for > 24hs but the client wishes that the TS-guest has to reboot before the working hours begin.

It's just the order until the coming friday. I wrote i script and let in run via cron. The backups itself are running only in non-working hours at the night.

In esxtop is was not really a prominence visible. The DAVG/cmd is during the working hours 10-12 and rises during the backup to 25.

Another problem occured and affects the backup. The vSphere Client was also updated and reports every day that a consolidation is necessary - for the DC and TS.

Yesterday it was just the DC and i let the consolidation run and it finishedd successfully but today the consolidation involves the 2 guests.

In the snapshot-manager is no snapshot visible and the disks are in use, changed from DC.vmdk to DC-000001.vmdk.

The folder for TS contains the following entries:

/vmfs/volumes/54295773-23b4ec4d-aec7-c4346bac8728/TS # ls -lah

total 59844640

drwxr-xr-x    1 root     root        3.3K Jul 18 10:39 .

drwxr-xr-t    1 root     root        1.9K Jul 12 20:20 ..

-rw-------    1 root     root        6.3M Jul 18 10:39 TS-000002-ctk.vmdk

-rw-------    1 root     root       16.4M Jul 18 10:41 TS-000002-delta.vmdk

-rw-------    1 root     root         369 Jul 18 10:39 TS-000002.vmdk

-rw-r--r--    1 root     root          27 Jun 22 07:48 TS-4ba1ba0e.hlog

-rw-------    1 root     root       16.0G Jul 18 04:10 TS-4ba1ba0e.vswp

-rw-------    1 root     root        6.3M Jul 18 10:39 TS-ctk.vmdk

-rw-------    1 root     root      200.0G Jul 18 10:39 TS-flat.vmdk

-rw-------    1 root     root        8.5K Jul 18 10:39 TS.nvram

-rw-------    1 root     root         571 Jul 18 10:39 TS.vmdk

-rw-r--r--    1 root     root          79 Jul 18 10:39 TS.vmsd

-rwxr-xr-x    1 root     root        4.0K Jul 18 10:39 TS.vmx

-rw-------    1 root     root           0 Jul 18 04:10 TS.vmx.lck

-rw-r--r--    1 root     root        3.2K Jul 17 20:17 TS.vmxf

-rwxr-xr-x    1 root     root        4.0K Jul 18 10:39 TS.vmx~

-rw-r--r--    1 root     root       17.7M Jul 12 19:33 vmware-40.log

-rw-r--r--    1 root     root       58.0M Jul 15 19:02 vmware-41.log

-rw-r--r--    1 root     root      207.8K Jul 16 11:01 vmware-42.log

-rw-r--r--    1 root     root      259.5K Jul 17 20:17 vmware-43.log

-rw-r--r--    1 root     root      178.2K Jul 17 20:30 vmware-44.log

-rw-r--r--    1 root     root      376.2K Jul 18 03:55 vmware-45.log

-rw-r--r--    1 root     root      684.6K Jul 18 10:39 vmware.log

-rw-------    1 root     root      129.0M Jul 18 04:10 vmx-TS-1268890126-1.vswp

For the DC:

/vmfs/volumes/54295773-23b4ec4d-aec7-c4346bac8728/DC # ls -lah

total 341381152

drwxr-xr-x    1 root     root        3.8K Jul 18 07:36 .

drwxr-xr-t    1 root     root        1.9K Jul 12 20:20 ..

-rw-------    1 root     root        4.7M Jul 18 07:36 DC-000001-ctk.vmdk

-rw-------    1 root     root        5.8G Jul 18 10:42 DC-000001-delta.vmdk

-rw-------    1 root     root         369 Jul 18 07:36 DC-000001.vmdk

-rw-r--r--    1 root     root          27 Oct 14  2014 DC-4ba1b7ee.hlog

-rw-------    1 root     root       15.0G Jul 16 13:51 DC-4ba1b7ee.vswp

-rw-r--r--    1 root     root          13 Nov 13  2016 DC-aux.xml

-rw-------    1 root     root        4.7M Jul 18 07:35 DC-ctk.vmdk

-rw-------    1 root     root      300.0G Jul 18 07:35 DC-flat.vmdk

-rw-------    1 root     root        8.5K Jul 18 07:36 DC.nvram

-rw-------    1 root     root         571 Jul 18 07:35 DC.vmdk

-rw-r--r--    1 root     root          79 Jul 18 07:35 DC.vmsd

-rwxr-xr-x    1 root     root        3.5K Jul 18 07:35 DC.vmx

-rw-------    1 root     root           0 Jun 15 11:21 DC.vmx.lck

-rw-r--r--    1 root     root        3.2K Jul 18 10:14 DC.vmxf

-rwxr-xr-x    1 root     root        3.5K Jul 18 07:35 DC.vmx~

-rw-------    1 root     root        6.3M Jul 18 07:36 DC_1-ctk.vmdk

-rw-------    1 root     root      200.0G Jul 18 10:41 DC_1-flat.vmdk

-rw-------    1 root     root         575 Jul 18 07:36 DC_1.vmdk

-rw-r--r--    1 root     root       91.9M Feb  9 20:11 vmware-21.log

-rw-r--r--    1 root     root      145.8M Jun  7 09:11 vmware-22.log

-rw-r--r--    1 root     root      219.0K Jun  7 09:22 vmware-23.log

-rw-r--r--    1 root     root      221.4K Jun  7 10:21 vmware-24.log

-rw-r--r--    1 root     root      841.1K Jun 15 10:26 vmware-25.log

-rw-r--r--    1 root     root        2.3M Jul 16 11:16 vmware-26.log

-rw-r--r--    1 root     root        2.1M Jul 18 10:37 vmware.log

-rw-------    1 root     root      129.0M Jul 16 12:19 vmx-DC-1268889582-1.vswp

Update:

The consolidation successfully finished after i shutdown the veeam service on the 3rd guest.

Reply
0 Kudos
dekoshal
Hot Shot
Hot Shot

Issue where you see the snapshot in the virtual machine folder but not seeing it in the snapshot manager occurs when vmware receives the command to remove the snapshot (manual or automatic ) but could not remove it because of time out due to latency or any other reason. So in order to let the VMware administrator know that there are leftover snapshot on the VM which needs to be removed it show as message on the vm summary stating virtual machine disk consolidation is required. Once disk consolidation is completed leftover delta disk is merged into base disk.

Above mentioned issue could arise because of many reasons such as :

1. Compatibility issue between vsphere and backup solution. Check if backup solution requires any firmware update or patching.

2. Created multiple backup job with small batch of vm's and schedule it in a way that all backup job do not trigger at the same time.

3. Make sure esxi HBA driver is updated.

If you found this or any other answer helpful, please consider the use of the Correct or Helpful to award points.

Best Regards,

Deepak Koshal

CNE|CLA|CWMA|VCP4|VCP5|CCAH

https://in.linkedin.com/in/dkoshal

Reply
0 Kudos