VMware Cloud Community
arent_t
Hot Shot
Hot Shot

VM Freezes .. network stil replies !!

siutation:

  • one VM up and running on an ESX 3.5 server

  • I unplug both fibre cables so there is no SAN connectivity anymore

Now the VM freezes ...which is okay .. but whats strange is that it still replies to pings .. !!!!

Any ideas ?

Thanks

0 Kudos
19 Replies
jonb157
Enthusiast
Enthusiast

Well, I assume your VM's are located on that SAN where you are unplugging your HBA fibre cables. The VM data is encapsulated in that .vmdk, so you are basically halting any read/write operations when you do that. The ESX host still has full network connectivity as they are independant of each other unless you are using iSCSI, etc.

0 Kudos
arent_t
Hot Shot
Hot Shot

Indeed the VM is loated on the SAN, but how come it still replies to ping ... while the vmkernel puts the vm to sleep !!

Ex. wha in case of MSCS .. in where u have two nodes on two different ESX servers and one ESX host is loosing its connection to the SAN so meaning the active cluster node is not up and running anymore but meanwhile the heartbeat connection keeps responding so the second node will not take ownership of the resources ... NO GOOD !!

Another funny thing is that when I plug back the fibre cables after ex. 15 minutes , vmkernel sees the connections back and the VM starts running like nothing ever happened ..

Although .. I would nearly call it a feature pack .. :smileylaugh: .. but then still at least the network should not reply ...

Have to check tomorrow .. but I guess in this situation the vmm for this VM should stop running .. which I guess this is not the case .. as how otherwise the network cannot still be operational for this VM ?:|

0 Kudos
LarsLiljeroth
Expert
Expert

The ping is only relying on the mac address responding and the network part of this is still up and running.. As mentioned it is only disk I/O that is halted.

If you keep out the cables out much more you could see strange errors... i have seen that VC are having problems find the vmx files because of dead paths to the LUN .

It is not a feature Smiley Wink You can see that same on a psycical server. If the disk subsytems fails the OS can still respond to ping...

// Lars Liljeroth -------------- *If you found this information useful, please consider awarding points for "Correct" or "Helpful". Thanks!!!
0 Kudos
jakganesh
Hot Shot
Hot Shot

Even though u r VMs are on the SAN, you will get ping replies.

Because the network core service will be running on the server, when u disconnect the fiber the vm images will be get disconnected from the SAN to the server.

By using server resources only those images are coming to live, so even if you disconnect the fiber, the running files & services which are related to that image will be still alive on the server.

so graphically the server get freezedup. but already live services which are running will be there still alive, so you will be getting reply for u r ping response.

Jak
0 Kudos
arent_t
Hot Shot
Hot Shot

If a the system disk is not responding even on physical server, you get a bleuscreen after the disk timeout value expires ... why dont I get this .. !?

how can you run services on a server without disks !! and again when the vmkernel is putting a vm to bed meaning the vm cannot do anything .

I m not 100% sure but I am positive I have not seen this behavior with earlier version of ESX .. I will do some further testing and get back

0 Kudos
jhanekom
Virtuoso
Virtuoso

VMTools sets the disk timeout to 60 seconds... I've not tried what you're doing, though I have managed to reboot a SAN on me (unplanned) without having the VMs fail Smiley Happy

Would that 60 second period explain what you're seeing, or not?

0 Kudos
arent_t
Hot Shot
Hot Shot

Well exactly that was my intial thought as well, but in that case it should bluescreen after 60 seconds.. which it doesnt.

I did some further testing on an ESX 2.5.5, just to see if there is a different behaviour compared to a ESX 3.5, but its exactly the same ..

Also while doing this, I saw the VMM stayed running for this VM which probably explains the network replies I am receiving ..

Anyway I do not think this behavior is normal ..

0 Kudos
jakganesh
Hot Shot
Hot Shot

Based on such kind of behaviour only, VMware is going to release offline virtual mechines, which states that the main image will be stored in the SAN & an offline image will be loaded into your mechine, wht ever work u do that will be saved on that offline image, once after completion the original image in the SAN will get updated.

Jak
0 Kudos
jakganesh
Hot Shot
Hot Shot

Based on such kind of behaviour only, VMware is going to release offline virtual mechines, which states that the main image will be stored in the SAN & an offline image will be loaded into your mechine, wht ever work u do that will be saved on that offline image, once after completion the original image in the SAN will get updated.

Jak
0 Kudos
Funtoosh
Enthusiast
Enthusiast

Is this true always? How long you have waiting before you plugged back that fiber cable? We also had event where the cable was pulled out accidentally and we got notified after an hour that VM's were down. We then created HA event to move those VM's to different host. We use OVO mechanism to monitor those VM's . This happen in live environment. If you can share your VM's log it would be helpful.

0 Kudos
arent_t
Hot Shot
Hot Shot

It seems so ... I waited for about 15 minutes .. once I plugged them back the vm continued as normal. I know you can probably overcome the issue when the VM is part of HA cluster and enable the VM montioring as such you power off the machine when no heartbeat is received from the vm tools ..

Anyay I hope somebody can give me an explanation why the VM doesnt crash.. I dont understand it .. vmkernel doesnt do any caching or what so ever so ...

I have asked some vmware guys, they couldnt give me an answer straight away ..

0 Kudos
patrickds
Expert
Expert

Even a physical server will survive removal and replacing of its disk for quite some time if there's no i/o going on, since most vital processes are in memory and don't require disks anyway, once they're started.

The time it takes to get errors and crashes, depends on whether the running processes need access to the disk frequently.

And the 60 seconds, is timeout before you get disk failure notifications, it does not mean your machine will crash.

A clean Windows 2003 server will keep running for a long time without a disk

I gues eventually it will fail, even if it's a VM, because there are always some services that want to do some logging and such to disk.

Just try starting programs/services and doing some real work on the machines, and they'll fail sooner.

0 Kudos
jakganesh
Hot Shot
Hot Shot

Hi,

Even if you take a normal system your OS will be loaded on the disk and when you switch on the system wht ever programs & application first will be loaded from HDD into RAM. Even Kernel & wht ever output u view on the monitor that also should come from RAM only, when you want to open a new application first wht ever command query will go to the Kernel which will be loaded on the RAM there the kernel will check for that application in the registry & kernel will load the application from HDD into RAM ok, when the application will be loaded into the RAM from that time only you will be able to access that application. so HDD is a Secondary Memory device for the processor & RAM will the Primary Main Memory to store the processes.

Like that only our VMware will be calling the Image from the SAN but SAN will not give the HARD WARE resource to run the same IMAGE, Our VMware ESX will be running that IMAGE on the local available HARDWARE resources. so when you remove the SAN connection already ESX has loaded the required modules into the local RAM to run that VM. So here when you want to see the activity of the VM already the connection with the VM Image has been lost, so due to that running VM will get freezup & why it is able to give response for you Ping, because the required modules for the network connectivity are already available in the RAM & the IP & MAC Add are hardcoded on the virtual VM adapter abd that resource is used from local ESX server, so hence you are able to get your ping reply.

Jak
0 Kudos
arent_t
Hot Shot
Hot Shot

A reply to the last two comments:

Okay so if that is true then explain me the following: I re-produced using the SDK the scenario by removing the system disk from a running virtual machine, I did it on a XP and a 2003 virtual machine. Now nearly instantly I received a blue screen and the virtual machine got rebooted. .. This is what I would like and expect to see ... Now removing the system disk either by "removing it from the VM" or by disconnecting the fibres is in essence same thing from a windows point of view .. "you are taking disk away"..

It still sounds odd to me !

0 Kudos
patrickds
Expert
Expert

Removing the disk programmatically while the vm is running, is probably interfering with the vm in such a way that it causes the bluescreen.

Not actually the same as removing the disk physically.

I'm not just inventing things, i have actually removed disks from running systems and reinserted them without any errors; and not disks from a RAID set, plain single scsi disks on a test machine.

The same thing happens basically when you get a storage failover on a SAN.

Path fails, this gets noticed by the ESX (or any OS with a multipath driver installed) host, which initiates a scan on the backup paths and starts failing over the LUNS to another available path.

The whole process from 1 path failing until complete failover can easily take 30 seconds or more.

A system bluescreening immediately when a disk is removed is abnormal, not the opposite.

0 Kudos
jhanekom
Virtuoso
Virtuoso

You may be onto something there, patrick. arent_t: can you try doing a rescan on the ESX host after unplugging the fibres to see if that causes the VM to fail?

0 Kudos
arent_t
Hot Shot
Hot Shot

Tried that .. although the LUN's will dissappear the virtual machine will keep responding to network pings ..

Another thing I have tried is the new HA feature, to monitor an individual VM based on the hearbeats. This works brilliant, you either have the option to leave it running or power it off.. In case of the power-off setting you have a perfect work arround for this issue ..

But still ..

0 Kudos
arent_t
Hot Shot
Hot Shot

Patrick, when you say removed disks frol a running physical server, are you talking about the system disk ? or data disks ?

0 Kudos
patrickds
Expert
Expert

system disk.

There was just the one disk in it.

0 Kudos