Mikky83
Contributor
Contributor

VM freeze for 4-5 seconds and disconect all users

Hi,

I read all similar problems on communities and not found solution. Here I send .vmx file and log file.

0 Kudos
12 Replies
idle-jam
Immortal
Immortal

how frequent it happen? and also is it the first time? please elaborate more for us to assist you further.

0 Kudos
ChrisDearden
Expert
Expert

What sort of VM ? have you got something running snapshots to it ?

If this post has been useful , please consider awarding points. @chrisdearden http://jfvi.co.uk http://vsoup.net
0 Kudos
Mikky83
Contributor
Contributor

Hello,

This is first time this problem occured. This is fileserver and clients open documents via network sharing on this server. Ping is ok (there is no network problems) After some time server on VM (Windows 2003 R2) freezes and clients lost server for 5-6 seconds. Then they continue to work.

There are no errors in event viewer. I found similar problem in communities with

vmx| GuestRpcSendTimedOut: message to toolbox timed out. and

vmx| GuestRpc: app toolbox's second ping timeout; assuming app is down

I uninstalled vmware tools but problem still remains. Freeze is more frequenty when more client work on fileserver (VM). Storage is IBM DS5100, ESX is 4.1. There are no snapshots on VM.

Is it possible the problem to be with I/O on storage?

What is your experience with this type of error logs?

0 Kudos
ChrisDearden
Expert
Expert

ok its a file server - wonder if its a pathing issue to your back end storage?

If this post has been useful , please consider awarding points. @chrisdearden http://jfvi.co.uk http://vsoup.net
0 Kudos
Mikky83
Contributor
Contributor

No, It't not SAN issue, this VM worked fine without problems 2-3 days ago. SAN is defined one year ago by the book. It's not SAN issue.

The only things that change/problem in this 2-3 days are:

1. Windows updates and antivirus updates.

2. Failed battery cache pack on IBM Storage DS5020 on DR site (there is a remote mirroring with DS5100, but this issue is a little bit unlogical). I can't point to some logical dependence in this issue. (maybe some bug ???)

I want to be sure what kind of problem is this issue. Is it I/O problem (Storage), or Is it some vmware problem (vmware tools ...) ?

0 Kudos
ChrisDearden
Expert
Expert

Is your storage is set for synchronus replication , writes are not being commited untill they have been written at your secondary site.

If the performance of your secondary array has degraded - that coudl in turn affect performance of primary storage.

If this post has been useful , please consider awarding points. @chrisdearden http://jfvi.co.uk http://vsoup.net
0 Kudos
Oli_L
Enthusiast
Enthusiast

Failed battery cache pack on IBM Storage DS5020

Random guess but is cache mirroring enabled on your controllers? Maybe it can't mirror the cache anymore and the IO request fails

Oli

0 Kudos
Virtuadude
Enthusiast
Enthusiast

Check the vmkwarning logs. You should get a clue if it is a problem been caused due to Storage.

VMs may freeze if the storage and ESX host connectivity has some issues or there is a path thrashing going on.

If possible try to migrate the VM to local storage and check if that resolves the issue.

0 Kudos
Oli_L
Enthusiast
Enthusiast

If possible try to migrate the VM to local storage and check if that resolves the issue.

This is a great suggestion! Especially if you can hot clone...

Oli

http://ninefold.com

0 Kudos
Mikky83
Contributor
Contributor

Storage use asinchrounus remote mirroring. I don't know how this can affect to primary LUN's

Ok, thanks for suggestions I allready contact vmware support about this issue, I'll send feedback when solve the problem.

0 Kudos
indyodie
Enthusiast
Enthusiast

Are you managing the host via virtual center?  Could there be an issue with the number of connections or the vc services being restarted?

PHD Virtual Technologies
0 Kudos
Mikky83
Contributor
Contributor

Hi to all,

I want to share feedback from this problem with you all. The problem definitlly was I/O problem (not vmware tools, not HW problem ...).

As I allready mentioned storage system on remote site had failed battery (there was assynchronus replication with primary storage). The whole situation with replication impact on primary site is absoluttly unlogical, but when the cache battery was replaced the problem gone. Definittly it is some IBM storage async replication bug Smiley Happy

0 Kudos