VMware Cloud Community
maskham
Contributor
Contributor

vSphere 5.5 VM console access errors / MKS connection terminated by server & MKS malformed repsonse from server

I have just rolled out vCloud 5.5 and vSphere 5.5 and have struck an issue that re-occurs every 13 days after a physical host reboot.

When trying to connect to a VM console using either [vCD/Web Client/c# Client] I get the following errors:

1. Blank Screen

2. Occasionally MKS malformed response from server

3. Unable to connect to the MKS: Connection terminated by server

I have tried Migrating the VM to another host made no difference, restarting the management agents on the host made no difference, Rebooting the host

and migrating the VM back to host fixed the issue. Suspect a bug in vSphere 5.5 as all [3] host had identical problems. So far I have struck this issue twice and

my servers have been up for 7 days and expect the 3rd occurrence to occur 13/11/2013 ;(

Issue started 16/10 08:30am [first noticed]

Server uptime all [3] hosts 13 days

ESXi build 5.5.0 1331820


Issue re-occurred 29/10 15:00 [second time]

Server uptime all [3] hosts 13 days

ESXi build 5.5.0 1331820

 

Below is Syslog Server Capture when attempting a VM console connection.


<166>2013-10-15T23:08:39.722Z fcsesx01.fred.local Hostd: -->

<166>2013-10-15T23:08:39.722Z fcsesx01.fred.local Hostd: [3A160B70 verbose 'Hostsvc.ResourcePool pool18'] Added child 26 to pool

<166>2013-10-15T23:08:39.723Z fcsesx01.fred.local Hostd: [3A8D2B70 info 'Libs'] CnxAuthdConnect: Returning false because CnxAuthdProtoReadResponse2 failed

<166>2013-10-15T23:08:39.723Z fcsesx01.fred.local Hostd: [3A8D2B70 info 'Libs'] CnxConnectAuthd: Returning false because CnxAuthdConnect failed

<166>2013-10-15T23:08:39.723Z fcsesx01.fred.local Hostd: [3A8D2B70 info 'Libs'] Cnx_Connect: Returning false because CnxConnectAuthd failed

<166>2013-10-15T23:08:39.723Z fcsesx01.fred.local Hostd: [3A8D2B70 info 'Libs'] Cnx_Connect: Error message: Connection terminated by server

<166>2013-10-15T23:08:39.723Z fcsesx01.fred.local Hostd: [FFE40B70 info 'Vmsvc.vm:/vmfs/volumes/5248c6f0-6a442ffe-c35f-0017a47724d8/diskload (a33d96fa-47c9-4135-b05e-fd39ce56d980)/diskload (a33d96fa-47c9-4135-b05e-fd39ce56d980).vmx'] Foundry_[Create|Open]Ex failed: Error: (3008) Cannot connect to the virtual machine

<166>2013-10-15T23:08:39.723Z fcsesx01.fred.local Hostd: [FFE40B70 info 'Vmsvc.vm:/vmfs/volumes/5248c6f0-6a442ffe-c35f-0017a47724d8/diskload (a33d96fa-47c9-4135-b05e-fd39ce56d980)/diskload (a33d96fa-47c9-4135-b05e-fd39ce56d980).vmx'] Failed to load virtual machine.

<166>2013-10-15T23:08:39.723Z fcsesx01.fred.local Hostd: [FFE40B70 info 'Vmsvc.vm:/vmfs/volumes/5248c6f0-6a442ffe-c35f-0017a47724d8/diskload (a33d96fa-47c9-4135-b05e-fd39ce56d980)/diskload (a33d96fa-47c9-4135-b05e-fd39ce56d980).vmx'] Failed to load VM N5Vmomi5Fault11SystemError9ExceptionE(vmodl.fault.SystemError)

<166>2013-10-15T23:08:39.723Z fcsesx01.fred.local Hostd: [FFE40B70 info 'Vmsvc.vm:/vmfs/volumes/5248c6f0-6a442ffe-c35f-0017a47724d8/diskload (a33d96fa-47c9-4135-b05e-fd39ce56d980)/diskload (a33d96fa-47c9-4135-b05e-fd39ce56d980).vmx'] Marking VirtualMachine invalid

<166>2013-10-15T23:08:39.724Z fcsesx01.fred.local Hostd: [FFE40B70 info 'Vmsvc.vm:/vmfs/volumes/5248c6f0-6a442ffe-c35f-0017a47724d8/diskload (a33d96fa-47c9-4135-b05e-fd39ce56d980)/diskload (a33d96fa-47c9-4135-b05e-fd39ce56d980).vmx'] State Transition (VM_STATE_INITIALIZING -> VM_STATE_INVALID_CONFIG)

<166>2013-10-15T23:08:39.724Z fcsesx01.fred.local Hostd: [FFE40B70 verbose 'Vmsvc.vm:/vmfs/volumes/5248c6f0-6a442ffe-c35f-0017a47724d8/diskload (a33d96fa-47c9-4135-b05e-fd39ce56d980)/diskload (a33d96fa-47c9-4135-b05e-fd39ce56d980).vmx'] Time to load virtual machine: 51 (msecs)

<166>2013-10-15T23:08:39.724Z fcsesx01.fred.local Hostd: [FFE40B70 info 'Vmsvc'] Loaded virtual machine: /vmfs/volumes/5248c6f0-6a442ffe-c35f-0017a47724d8/diskload (a33d96fa-47c9-4135-b05e-fd39ce56d980)/diskload (a33d96fa-47c9-4135-b05e-fd39ce56d980).vmx

<166>2013-10-15T23:08:39.724Z fcsesx01.fred.local Hostd: -->


Any help would be appreciated.

I have also logged the call with HP as we have software support with them and they mentioned that a number of sites have similar issues and a call has been place with VMware for further investigation

Rgds

Mark Askham

Senior Infrastructure Architect

76 Replies
VZh
Contributor
Contributor

I have the same problem. VM console runs for 11 days, and on the twelfth day of the console stops working. During the connection get the error: Unable to connect to the MKS: Malformed response from server.

Helps only restart the host.

ESXi: HP Custom ISO 5.5.0 1331820

VC: 5.5.0 1312298

Have any ideas how to solve this problem?

Reply
0 Kudos
MatW
Contributor
Contributor

I also ran into the "Unable to connect to the MKS: Connection terminated by server" trying to open the console via vCenter.  I'm building a new 5.5 test environment and everything has been running mostly idle for ~2 weeks and when I went to access the consoles of the 4 VMs I have, none worked.


As a workaround, I did notice that I could use the vSphere client, connect to the host directly and open the console of the VM. 

Reply
0 Kudos
jhirsh
Contributor
Contributor

I've also seen the same issue. My hosts are using the HP custom 5.5 image too. I have an existing ticket open with VMware on the issue, but they don't have a resolution at the moment and at the time hadn't run into it before.

In my case, I typically see the "malformed response" error more than the "terminated by server" one, but both have occurred. Connecting directly to the host doesn't allow the console to open either. Workarounds to date have been to vmotion to a working host, or reboot the host. I haven't seen it occur on all of my hosts.. so far it's only occurred on one of my clusters.

However, I have another issue that may (or may not) be related to the HP image, which occurs after a random amount of time where the host effectively is out of resources to spawn new processes, which affects certain tasks. An easy way to identify that situation is that SSH drops the connection prior to being able to enter a password and you'll see errors in the syslog complaining that cron can't fork due to lack of space (but the server isn't out of space).

Thanks

-Joshua

Reply
0 Kudos
jhirsh
Contributor
Contributor

Hello All,

I had some traction on one of my VMware support cases. They found a memory issue with the hpHelper process/world, which was consuming almost all of the available memory in the init resource pool (with a defined maximum of 220MB). On a server that was acting up (ssh connections being dropped before you could enter credentials, plus the MKS console errors), stopping the hp-ams service instantly allowed the server to resume normal operations.

     /etc/init.d/hp-ams.sh stop

HP hasn't progressed on their side of the ticket yet, but for the time being, the workaround appears to be to stop the hp-ams service.

Cheers

-Joshua

themac201110141
Contributor
Contributor

jhirsh,

Thanks for the information. We are experiencing the same problem and hp-ams.sh stop seems to help.

Please let us know when HP has some info!

Thanks

Reply
0 Kudos
jhirsh
Contributor
Contributor

They haven't been able to reproduce it yet. The engineer that has my ticket was going to build out a lab with the same hardware as me to try and recreate the issue. What type of servers are you seeing the problem on?

I see it on my BL660 Gen8 servers.

Reply
0 Kudos
themac201110141
Contributor
Contributor

Were running esxi 5.5 on bl460c Gen8 servers.

//Markus

Reply
0 Kudos
AndyRawcs
Contributor
Contributor

I have been running an ESXi 5.5 environment for 13 days now on BL460C Gen8 servers as well and today have started to get the MKS issues with all my VMs on all 4 hosts.  I can RDP to them all and still seem to have control of snapshots etc through the console as normal.

Feel slightly better that this is a known issue reading all these posts, but not that there is no known fix.

It is also causing havoc with my Veeam backups as some of the datastores are showing as locked, with the following error which I presume is related:

01/12/2013 05:59:35 :: Processing 'PHGBI1' Error: Client error: File does not exist or locked. VMFS path: [[Datastore05] PHGBI1/PHGBI1.vmx].

Please, try to download specified file using connection to the ESX server where the VM registered.

Failed to create NFC download stream. NFC path: [nfc://conn:phgvcentre.progressgroup.org.uk,nfchost:host-14,stg:datastore-31@PHGBI1/PHGBI1.vmx].

I am in the middle of migrating all my VMs over to the new environment but think I will get HP support involved before I go any further.

Reply
0 Kudos
AndyRawcs
Contributor
Contributor

Hi Josh

I will link in your HP fault with mine tomorrow, the response HP have sent me was

"We would like to confirm that this is a known issue and VMware engineering is still working on the issue and expected release of ESXi 5.5 Update 1 is on 22nd Dec 2013, tentative date only. Workaround for now would be to restart services on ESXi host."

The reboot of the hosts has cleared the problems and Veeam is now backing up again and i have full control of the VMs back in Vcentre.

I am presuming that I will have to reboot again in 14 days, as if the uptime figure is 13 days and that the count effectively starts at 0, is that what everyone has found?

Doesn't really look like we have other alternative but to wait and apply the patch.

Reply
0 Kudos
jhirsh
Contributor
Contributor

Sounds like you've had more of a response than I've received. They weren't aware of the issue and neither was VMware (they basically closed my ticket(s) and pointed the finger at HP). Interesting that they're saying it's a fault within ESXi itself, as so far it's been looking like a 100% HP issue.

To avoid the reboots, stop the hp-ams service after your host boots up and before the problem occurs. With that service stopped, the issue will not present itself. If your hosts don't have the ESXi shell enabled, and this bug occurs, then your only course of action is a reboot. If your ESXi shell is enabled (prior to the issue, as it can't spawn after the issue starts), you can still get in via console and stop the hp-ams service and recover your host without requiring a reboot.

On my servers it seems to occur randomly after 14 days. Some servers hit it almost immediately after two weeks, others were a bit later. You can see when the problem starts, as you'll notice issues start to get logged to /var/log/hpHelper.log similar to the following every 10 minutes or so:

     AgentX master agent failed to respond to ping.  Attempting to re-register.

     ilo_close fd 8.  data=0x49d6bd08

     Failed: go_bye() read() returned err=1

As well as errors from cron not being able to spawn new processes, etc.

Cheers

-Joshua

Reply
0 Kudos
abaack
Contributor
Contributor

Started seeing the same thing on my servers today.  Thanks for all the help.

Will wait to find out when there's a fix.

Reply
0 Kudos
MatW
Contributor
Contributor

I'm running a DELL Blade Infrastructure on my end while encountering this issue.  It's very problematic and is keeping me from progressing on my deployment currently.  😞

Reply
0 Kudos
abaack
Contributor
Contributor

That may be a different issue though.  Are you running off of a custom Dell ESX ISO?

Is there a similar process running (like the HP has) that can be stopped that fixes it right away?

Reply
0 Kudos
joshswain
Contributor
Contributor

We're seeing it here also on HP BL460 G8 servers. If it's occurring on Dell also I'd think it's a vmware problem with processes running out of resources.

Reply
0 Kudos
abaack
Contributor
Contributor

I'll keep an eye on my BL490 G6 servers which I haven't noticed any problems yet.  Does everyone have Gen8 ones?

Reply
0 Kudos
AndyRawcs
Contributor
Contributor

I have the Gen8

Reply
0 Kudos
joeshmo
Contributor
Contributor

Same issue on all of our BL460c Gen8 here.

Reply
0 Kudos
MatW
Contributor
Contributor

Another little update...

I brought back up most of my environment to tinker with this a bit more.  I have a (16) DELL M610 blade cluster and a 3-host cluster of DELL R610s.  I moved a VM between the clusters, rebooted the ESXi host the VM was on as well as the vCenter and I still cannot open a console via vSphere or Web clients.

Reply
0 Kudos
McKimmi
Contributor
Contributor

Upgrade the HP BL460c to the latest firmware version 'HP_Service_Pack_for_Proliant_2013.09.0-0_744345-001_spp_2013.09.0-SPP2013090.2013_0830.30.iso'

It resolved the problem for us.

Reply
0 Kudos