VMware Cloud Community
Greg171
Contributor
Contributor

Unexpected host shutdown.

Hi there.

I'm new so plz be gentle. Smiley Happy

We are going to set up 2 ESX servers with a number of hosts on them.

The hardware is IBM SystemX 3650 (7947-52G) with loads of disks (10x 300 GB and 2 x147 GB) and we are running ESX 4.1 and connecting directly wia vSphere client (not running a vCenter Server as of yet). Discs are handled as 3 virtual units configured as follows: R1 147GB, R5 1,9TB and R1 278GB

All is fine and dandy up to the point when the win2k8 R2 64-bit hosts start powering down autonomusly.

In the console (vSphere Client) the event logged is ' "machine name" is powered off' and there is a timestamp and action is 'ordered' by the user 'User'

(not defined in the system/vSphere Client). Also it is for certain that the poweroff is not initialized by a user from within windows as the server has

no network access as of ye, nor is it a power saving option.

Shutdown takes place about 60 min after power up. Like clockwork!

What is even more perplexing is that only some of the VM's are doing this.

In the log/vmkernel I can see something like 'cpuXX VSCSI:XXYY: handle ZZHH Reset request on fss handle' and in the Win eventviewer I can some 'kernel general' and 'kernel power manager' messages but the list of events start with multiple entries of services being stoped.

I've read around on forums but the logic around hardware/driver issues just doesn't add up when 6 of my 10 VM's work fine.

Plz help as I'm getting to a point of total desperation.

Cheers!!

/Greg

0 Kudos
12 Replies
DuncanArmstrong
VMware Employee
VMware Employee

Welcome to the forums!

Out of curiosity, do you happen to have vCenter deployed at this time? The two servers would be managed by such a host.

I'd look to see if you had Guest Monitoring enabled - if so, disable it for now and see if the unexpected shutdowns cease. The default action of such a tool is to power off and reset virtual machines that are deemed unresponsive (namely, loss of heartbeats from VMware-Tools),

Do you have any third-party applications, utilities, backup agents, or software installed on the ESX hosts' consoles?

You can also provide time frames of when the issues occurred and the /var/log/vmkernel, /var/log/messages, and /var/log/vmware/hostd.log files for review, but if it gets too involving, you should probably start thinking about filing a Support Request.

Do all the virtual machines shut off at the same time?

0 Kudos
Greg171
Contributor
Contributor

Hi.

The servers are not managed via vCenter at this moment, ie working as 2 standalone units I recon.

Ergo, no guest monitoring running.

I have, since I am a "better-safe-than-sorry kinda fellow, only installed the OS on all the hosts

and wanted to validate that that part is doing what it shouild. Apparently it isn't.

Only post-OS installation modifications I've done is VM-tools, enabling RDP and changing power options so that THAT

doesn't mess with any further services.

As for the timing of the phenomenon: machines will power down approx. 60 min from power up. Independently.

Logs attached as requested the full zip-file as selection is a pain.

Specificly I guess this is strange (event list from vmkernel@shutdown of host):

Sep 13 10:13:53 bmaesx01 vmkernel: 2:21:27:44.044 cpu9:5296)VSCSI: 2245: handle 8204(vscsi0:0):Reset request on FSS handle 25198662 (0 outstanding commands)

Sep 13 10:13:53 bmaesx01 vmkernel: 2:21:27:44.044 cpu10:4215)VSCSI: 2519: handle 8204(vscsi0:0):Reset

Sep 13 10:13:53 bmaesx01 vmkernel: 2:21:27:44.045 cpu10:4215)VSCSI: 2319: handle 8204(vscsi0:0):Completing reset (0 outstanding commands)

Sep 13 10:13:54 bmaesx01 vmkernel: 2:21:27:45.267 cpu14:4140)NetPort: 1157: disabled port 0x200001b

Sep 13 10:13:54 bmaesx01 vmkernel: 2:21:27:45.267 cpu14:4140)NetPort: 1157: disabled port 0x200001c

Sep 13 10:13:54 bmaesx01 vmkernel: 2:21:27:45.267 cpu14:4140)Net: 1847: disconnected client from port 0x200001b

Sep 13 10:13:54 bmaesx01 vmkernel: 2:21:27:45.267 cpu14:4140)Net: 1847: disconnected client from port 0x200001c

Within windows you see:

'The process WLMS.ese has initiated the power off on behalf of user.... '

Fishy!!! Time to google some more! Smiley Wink

Cheers!

0 Kudos
virtualdive
VMware Employee
VMware Employee

Hey Greg,

Could be esx.conf is misconfigured somehow...you might wanna try the below command and see if it fixes the issue

esxcfg-boot- b

Regards,

'V'
thevshish.blogspot.in
vExpert-2014-2021
0 Kudos
Greg171
Contributor
Contributor

Thanks for the tip.

Tried it and validating right now.

I have also found that the *.exe I found in the win event viewer is a potential bad guy in this case.

It is a windows license validation process which could cause me problem as my copy isn't validated yet

(no network access).

However: this should be an issue with ALL VM's in that case my theory doesn't comply with reality.

I'll keep you posted! Smiley Happy

0 Kudos
DuncanArmstrong
VMware Employee
VMware Employee

Sorry about that pointless vCenter question - it was mentioned right in your original post, heh.

It seems the logs you collected are for your client, rather than the host - want to log in and collect it from the command-line using "vm-support" and collecting/attaching the .tgz file that's generated as a result?

Mind that hostnames and IPs are contained in the logs. Use a "sed" script to replace these instances in the log files if you absolutely need to.

The systems sound pretty basic for now, which is good for troubleshooting. I'm interested in what's in your /var/log/vmkernel and /var/log/vmware/hostd.log files.

0 Kudos
jlehtinen
Enthusiast
Enthusiast

Hiya,

I believe the "wlms.exe" is the Windows Licensing Manager service. If you have not activated your copy of Windows Server, it will shutdown every hour, this is normal.

Remember - you need to activate TRIAL copies of Windows Server as well - you get 10 days to do this. Once you activate the trial, THEN you get the full 180 days of trial use. If you do not do this step and you go 10 days past the install date the server will power off every hour.

So, either activate a trial license, or input your license key and it should be OK.

0 Kudos
Greg171
Contributor
Contributor

Hi!

jlehtinen: you're right. But how come this isn't the case with all servers?

They are all installed from the same iso, and most of them are running constantly.

I will however get some networking up and running so that I can validate our ideas.

Then I'll update you all. Thanks!

0 Kudos
Greg171
Contributor
Contributor

HI guys!

Seems that it was a windows license issue.

Now that the OS is activated it is up running way past 60 minutes.

Why only a few of the VM's acted this way is surely strange but nothing to waste time on.

I greatly appreciate the help and moral support you guys have given me.

Cheers!

0 Kudos
Greg171
Contributor
Contributor

HI guys!

Seems that it was a windows license issue.

Now that the OS is activated it is up running way past 60 minutes.

Why only a few of the VM's acted this way is surely strange but nothing to waste time on.

I greatly appreciate the help and moral support you guys have given me.

Cheers!

0 Kudos
jlehtinen
Enthusiast
Enthusiast

Awesome - I'm glad you got it working.

It is puzzling that only some of them were shutting down like that, especially if they were all off the same image.

Did you run sysprep to reseal the image before you made the ISO, or did you run newSID on each machine? If they all have the same identifier and they're sitting on the same network you can sometimes get inconsistant behavior from cloned machines.

0 Kudos
Greg171
Contributor
Contributor

I'm all smiles, too! :D:D:D:D

I didn't run anything as it is an ISO directly from MS.com as I already had the lic.key.

Also the mashines where not connected to the same network, as there was no working network

configured in VM other than a few VLANs. Is it possible that the host talked via the fictive win-IP's

over the same vSwitch?

Sounds odd, doesn't it?

And from comparing 2 hosts, the id (I guess you ment the computer name?) is not the same.

Thanks alot!

0 Kudos
jlehtinen
Enthusiast
Enthusiast

Yeah, that is pretty odd. Without looking at them directly, it would just be a guessing game on why some worked and others didn't. Since they're working now it's not a huge concern I suppose. 🙂

By identifier I meant SID, since it's a newer server OS it's probably highly unlikely it had anything to do with the SIDs.

As some (unnecessary) background, traditionally you needed to generate a new SID using Sysprep on every install of a Windows Operating system, otherwise resources that used the SID as an identifier would get messed up from the duplicate entries on your network. In newer OS's almost nothing references the SID so it really isn't a concern anymore. There's a good write-up on this here:

0 Kudos