VMware Cloud Community
Greg_Scott
Contributor
Contributor

Trying to access a virtual machine or even open a console window hangs and times out after several minutes

All I wanted to do was set up a test VM and practice some Win2003 config options. The environment includes two ESX 3.5 servers and VIC 2.5. The ESX servers are clustered with both DRS and HA. The hosts are named ESX1 and ESX2. The VM is named NSMTest. DRS was set to fully automated - I just now changed it to manual.

NSMTest is a Win2003 server. I cloned NSMTest from another VM, got rid of its 2nd hard drive I didn't need, and added a 2nd NIC. (The reason for the clone is testing the Microsoft iSCSI driver - that's why a 2nd NIC. All I want to do is clone a Win2003 machine, familiarize myself with the Microsoft iSCSI driver, and then get rid of the clone. How tough can it be - right?)

I powered up and booted NSMTest with both virtual NICs disconnected - then put the WIndows server into its own little workgroup named GREG.

The problems started when I connected the virtual NICs. I think DRS migrated the VM from ESX1 to ESX2. Meanwhile, I must have connected the NICs to virtual switches in the wrong ESX host. Since that time, anything I try to do with the NSMTest VM hangs and times out after several minutes. Opening a console window hangs, powering off hangs, connecting/disconnecting NICs hangs - you name it, everything hangs. I can't power off the VM and I can't access it in any way.

Following a suggestion in an earlier thread - http://communities.vmware.com/thread/80413 - I connected to the service console of ESX2 and tried this:

vmware-cmd /vmfs/volumes/NorthlandVSAdatastore/NSMTest/NSMTest.vms getstate

That also hung until I aborted it with CTRL/C.

I found this process running and killed it:

root@esx2 NSMTest# ps ax | grep NSMTest

30687 ? S< 0:01 /usr/lib/vmware/bin/vmkload_app /usr/lib/vmware/bin/vmware-vmx -ssched.group=host/user -# name=VMware ESX Server;version=3.5.0;licensename=VMware ESX Server;licenseversion=2.0 build-64607; -@ pipe=/tmp/vmhsdaemon-0/vmx14a34e7c94d2263f; /vmfs/volumes/47fa8ec4-a0d21c20-c4e8-0015600e290a/NSMTest

6771 pts/0 D 0:00 grep NSMTest

root@esx2 NSMTest# kill -9 30687

..but no joy. None of this worked. Now I have a VM stuck in some unkown state with no way I know of to look at it. Does anyone have any uggestions?

thanks

  • Greg Scott

0 Kudos
3 Replies
Greg_Scott
Contributor
Contributor

I dunno - I always find the oddball problems that everyone claims never happen....

After struggling for several hours yesterday and today with the problem, I decided the most expedient thing would be to just reboot the VMWare host ESX2. Nothing is in full production yet. So I logged in via putty and rebooted it. After waiting for roughly an hour, I looked at its console and I see the entire ESX host is hung on the way down.

The very last line says this:

Unmounting file systems:

So - cloning one VM and making a mistake someplace is now forcing me to cycle the power on an entire ESX host - what's up with all those reliability claims I keep reading about ???

(Yes, I'm venting a little bit)

- Greg Scott

0 Kudos
Greg_Scott
Contributor
Contributor

It looks like connecting a VM NIC to a vSwitch port on the wrong host causes the whole host to go unstable.

I went thru yesterday's /var/log/vmkernel from host ESX2. I extracted the relevant portions and I am attaching it in case anyone wants to look.

The sequence of events looks like this:

June 2 15:03 - I reconnected my VMWare iSCSI driver to the LeftHand Networks VSA after an earlier power failure. The LeftHand Networks VSA is the shared storage where all of my application VMs will live. The whole discovery process took a couple minutes.

June 2 15:05 - VMs are starting to automatically powerup like mad while DRS is evidently migrating them all over the place.

Here is a key sequence of events between June 2 15:08 and 16:53 - pasted in below. (I am also attaching the log file extract I made.) The VM named Exchange_Server evidently has a problem with its VMDK file. But I didn't know this at the time. I cloned it and made a new VM named NSMtest. Look at the very last event pasted in below at 16:53:51. This was when I tried to hook up one of the virtual NICs on the nsmtest VM. But it tried to connect the virtual NIC to the vSwitch on the other ESX server. Things went downhill from there.

The next day, June 3, there were a couple of event messages between 10:37 and 10:40 AM. I tried rebooting the whole host at 11:18 and there are a bunch of messages when the host hung. We cycled the power at 12:15 and that's when it came back up.

Jun 2 15:08:49 esx2 vmkernel: 21:06:54:25.225 cpu2:1072)MigrateNet: vm 1072: 854: Accepted connection from <192.168.9.15>

Jun 2 15:10:01 esx2 vmkernel: 21:06:55:36.798 cpu2:1095)Migrate: 7245: 1212438357078125: Stopping pre-copy: only 2550

pages were modified

Jun 2 15:10:22 esx2 vmkernel: 21:06:55:57.611 cpu1:1092)VSCSI: 4059: Creating Virtual Device for world 1094 vscsi0:0

(handle 8199)

Jun 2 15:10:22 esx2 vmkernel: 21:06:55:57.611 cpu1:1092)VSCSI: 4059: Creating Virtual Device for world 1094 vscsi0:1

(handle 8200)

Jun 2 15:14:27 esx2 vmkernel: 21:07:00:03.473 cpu0:1092)DevFS: 2208: Unable to find device:

21028-ExchangeServer_1-000001-delta.vmdk

Jun 2 15:14:27 esx2 vmkernel: 21:07:00:03.485 cpu0:1092)DevFS: 2208: Unable to find device:

102b-ExchangeServer-000001-delta.vmdk

Jun 2 15:16:06 esx2 vmkernel: 21:07:01:42.269 cpu2:1037)DevFS: 2208: Unable to find device:

2025-ExchangeServer-000001-delta.vmdk

Jun 2 15:16:06 esx2 vmkernel: 21:07:01:42.271 cpu0:1024)VMNIX: VMKFS: 1129: ExchangeServer-000001-delta.vmdk (8229),

status -9

Jun 2 15:16:06 esx2 vmkernel: 21:07:01:42.271 cpu0:1024)VMNIX: VMKFS: 1129: ExchangeServer-flat.vmdk (8228), status -9

Jun 2 15:16:07 esx2 vmkernel: 21:07:01:43.107 cpu3:1036)DevFS: 2208: Unable to find device:

9027-ExchangeServer_1-000001-delta.vmdk

Jun 2 15:16:07 esx2 vmkernel: 21:07:01:43.109 cpu0:1024)VMNIX: VMKFS: 1129: ExchangeServer_1-000001-delta.vmdk (36903),

status -9

Jun 2 15:16:07 esx2 vmkernel: 21:07:01:43.109 cpu0:1024)VMNIX: VMKFS: 1129: ExchangeServer_1-flat.vmdk (94246), status -9

Jun 2 16:03:01 esx2 vmkernel: 21:07:48:36.999 cpu3:1037)World: vm 1106: 895: Starting world vmware-vmx with flags 4

Jun 2 16:03:03 esx2 vmkernel: 21:07:48:39.003 cpu3:1106)World: vm 1107: 895: Starting world vmm0:NSMTest with flags 8

Jun 2 16:03:03 esx2 vmkernel: 21:07:48:39.003 cpu3:1106)Sched: vm 1107: 5333: adding 'vmm0:NSMTest': group 'host/user':

cpu: shares=-3 min=0 minLimit=-1 max=-1

Jun 2 16:03:03 esx2 vmkernel: 21:07:48:39.003 cpu3:1106)Sched: vm 1107: 5352: renamed group 25 to vm.1106

Jun 2 16:03:03 esx2 vmkernel: 21:07:48:39.003 cpu3:1106)Sched: vm 1107: 5366: moved group 25 to be under group 4

Jun 2 16:03:03 esx2 vmkernel: 21:07:48:39.203 cpu3:1106)Swap: vm 1107: 2169: extending swap to 4194304 KB

Jun 2 16:03:04 esx2 vmkernel: 21:07:48:40.170 cpu3:1106)VSCSI: 4059: Creating Virtual Device for world 1107 vscsi0:0

(handle 8201)

Jun 2 16:03:04 esx2 vmkernel: 21:07:48:40.248 cpu3:1106)World: vm 1108: 895: Starting world vmware-vmx with flags 44

Jun 2 16:03:04 esx2 vmkernel: 21:07:48:40.249 cpu1:1108)World: vm 1109: 895: Starting world vmware-vmx with flags 44

Jun 2 16:03:04 esx2 vmkernel: 21:07:48:40.249 cpu1:1108)World: vm 1110: 895: Starting world vmware-vmx with flags 44

Jun 2 16:03:04 esx2 vmkernel: 21:07:48:40.250 cpu3:1107)Init: 1054: Received INIT from world 1107

Jun 2 16:53:51 esx2 vmkernel: 21:08:39:27.432 cpu3:1107)Net: 4203: unicastAddr 00:50:56:80:07:9e;

Jun 2 16:53:51 esx2 vmkernel: 21:08:39:27.435 cpu3:1107)Net: 1027: can't connect device: ESX1 Virtual Machines on vSwitch1: Not found

- Greg Scott

0 Kudos
Samcer
Expert
Expert

Hi Greg Scott,

I have a different background which leads to the same scenario.

I do not have the answer but I gain the access to the environment by killing the hostd process and restart the process itself using "service mgmt-vmware restart".

some other lines in my log

Host: 2587: irq 6 vector 0x41 (host 0x51)

...

Setting capabilities 0x0 for device vmnic3

...

"VMOverheadGrowthLimit" = -1, Old Value: -1, (Status: 0x0)

...

my 2 cents

Sam

samcer| http://about.me/samcer | http://www.vm-support.it/ | @samuelecerutti
0 Kudos