VMware Cloud Community
maskham
Contributor
Contributor

vSphere 5.5 VM console access errors / MKS connection terminated by server & MKS malformed repsonse from server

I have just rolled out vCloud 5.5 and vSphere 5.5 and have struck an issue that re-occurs every 13 days after a physical host reboot.

When trying to connect to a VM console using either [vCD/Web Client/c# Client] I get the following errors:

1. Blank Screen

2. Occasionally MKS malformed response from server

3. Unable to connect to the MKS: Connection terminated by server

I have tried Migrating the VM to another host made no difference, restarting the management agents on the host made no difference, Rebooting the host

and migrating the VM back to host fixed the issue. Suspect a bug in vSphere 5.5 as all [3] host had identical problems. So far I have struck this issue twice and

my servers have been up for 7 days and expect the 3rd occurrence to occur 13/11/2013 ;(

Issue started 16/10 08:30am [first noticed]

Server uptime all [3] hosts 13 days

ESXi build 5.5.0 1331820


Issue re-occurred 29/10 15:00 [second time]

Server uptime all [3] hosts 13 days

ESXi build 5.5.0 1331820

 

Below is Syslog Server Capture when attempting a VM console connection.


<166>2013-10-15T23:08:39.722Z fcsesx01.fred.local Hostd: -->

<166>2013-10-15T23:08:39.722Z fcsesx01.fred.local Hostd: [3A160B70 verbose 'Hostsvc.ResourcePool pool18'] Added child 26 to pool

<166>2013-10-15T23:08:39.723Z fcsesx01.fred.local Hostd: [3A8D2B70 info 'Libs'] CnxAuthdConnect: Returning false because CnxAuthdProtoReadResponse2 failed

<166>2013-10-15T23:08:39.723Z fcsesx01.fred.local Hostd: [3A8D2B70 info 'Libs'] CnxConnectAuthd: Returning false because CnxAuthdConnect failed

<166>2013-10-15T23:08:39.723Z fcsesx01.fred.local Hostd: [3A8D2B70 info 'Libs'] Cnx_Connect: Returning false because CnxConnectAuthd failed

<166>2013-10-15T23:08:39.723Z fcsesx01.fred.local Hostd: [3A8D2B70 info 'Libs'] Cnx_Connect: Error message: Connection terminated by server

<166>2013-10-15T23:08:39.723Z fcsesx01.fred.local Hostd: [FFE40B70 info 'Vmsvc.vm:/vmfs/volumes/5248c6f0-6a442ffe-c35f-0017a47724d8/diskload (a33d96fa-47c9-4135-b05e-fd39ce56d980)/diskload (a33d96fa-47c9-4135-b05e-fd39ce56d980).vmx'] Foundry_[Create|Open]Ex failed: Error: (3008) Cannot connect to the virtual machine

<166>2013-10-15T23:08:39.723Z fcsesx01.fred.local Hostd: [FFE40B70 info 'Vmsvc.vm:/vmfs/volumes/5248c6f0-6a442ffe-c35f-0017a47724d8/diskload (a33d96fa-47c9-4135-b05e-fd39ce56d980)/diskload (a33d96fa-47c9-4135-b05e-fd39ce56d980).vmx'] Failed to load virtual machine.

<166>2013-10-15T23:08:39.723Z fcsesx01.fred.local Hostd: [FFE40B70 info 'Vmsvc.vm:/vmfs/volumes/5248c6f0-6a442ffe-c35f-0017a47724d8/diskload (a33d96fa-47c9-4135-b05e-fd39ce56d980)/diskload (a33d96fa-47c9-4135-b05e-fd39ce56d980).vmx'] Failed to load VM N5Vmomi5Fault11SystemError9ExceptionE(vmodl.fault.SystemError)

<166>2013-10-15T23:08:39.723Z fcsesx01.fred.local Hostd: [FFE40B70 info 'Vmsvc.vm:/vmfs/volumes/5248c6f0-6a442ffe-c35f-0017a47724d8/diskload (a33d96fa-47c9-4135-b05e-fd39ce56d980)/diskload (a33d96fa-47c9-4135-b05e-fd39ce56d980).vmx'] Marking VirtualMachine invalid

<166>2013-10-15T23:08:39.724Z fcsesx01.fred.local Hostd: [FFE40B70 info 'Vmsvc.vm:/vmfs/volumes/5248c6f0-6a442ffe-c35f-0017a47724d8/diskload (a33d96fa-47c9-4135-b05e-fd39ce56d980)/diskload (a33d96fa-47c9-4135-b05e-fd39ce56d980).vmx'] State Transition (VM_STATE_INITIALIZING -> VM_STATE_INVALID_CONFIG)

<166>2013-10-15T23:08:39.724Z fcsesx01.fred.local Hostd: [FFE40B70 verbose 'Vmsvc.vm:/vmfs/volumes/5248c6f0-6a442ffe-c35f-0017a47724d8/diskload (a33d96fa-47c9-4135-b05e-fd39ce56d980)/diskload (a33d96fa-47c9-4135-b05e-fd39ce56d980).vmx'] Time to load virtual machine: 51 (msecs)

<166>2013-10-15T23:08:39.724Z fcsesx01.fred.local Hostd: [FFE40B70 info 'Vmsvc'] Loaded virtual machine: /vmfs/volumes/5248c6f0-6a442ffe-c35f-0017a47724d8/diskload (a33d96fa-47c9-4135-b05e-fd39ce56d980)/diskload (a33d96fa-47c9-4135-b05e-fd39ce56d980).vmx

<166>2013-10-15T23:08:39.724Z fcsesx01.fred.local Hostd: -->


Any help would be appreciated.

I have also logged the call with HP as we have software support with them and they mentioned that a number of sites have similar issues and a call has been place with VMware for further investigation

Rgds

Mark Askham

Senior Infrastructure Architect

76 Replies
abaack
Contributor
Contributor

This is different than what people were experiencing in this thread... HP server specific problems.  I would either create a new thread here or contact VMware support.

0 Kudos
GroundX
Contributor
Contributor

I have DL380 Gen8 servers and can confirm that this is happening under 1331820 but not in 1623387.

Thought I could stay on 1331820 on servers that had that image, but now I will upgrade those to 1623387 as well.

0 Kudos
valpa
Contributor
Contributor

I can confirm that happened in 1623387 as well.

I cannot connect to VM console, saw the same error message.

I only has one host, no VC, no dns, purely use IP address when connecting from vSphere Client to that ESXi.

0 Kudos
McKimmi
Contributor
Contributor

Ik ben afwezig tot en met 25 augustus en heb maar beperkt toegang tot mijn e-mail. Voor dringende zaken kunt u contact opnemen met IT Servicedesk via telefoonnummer +31881162020 of e-mail:it-servicedesk@azl.eu.

Met vriendelijke groet,

Robert Cremers

IT Beheer & Onderhoud

AZL NV

0 Kudos
digitalnomad
Enthusiast
Enthusiast

Well it looks as though this issue is still alive and well. I worked with VMWare Support to identify the root cause of the failures and after checking VMKwarning.log it was confirmed that it was yet another issue with the hp ams service. In our case we run a scratch off a different partition and this took 82 days, almost exactly to surface. Here's the details on the build. I'm still awaiting feedback from HP on the "correct" way to remediate the issue. Frankly, this is my 2nd go around with similar issues that have hazarded my environment in 3 months and I'm seriously considering just removing it altogether.

Here's details on our affected build

HP DL560 G8

ESXi 5.5.0 189274

FW  02-2014 B

HP Offline Management Bundle v 1.7-13

AMS V 550.9.6.0-12.1198610

Build Original HP Custom CD 5.5x

Reported Errors on Affected Hosts

"Console" Can't Fork

VMotion Timed Out

VCenter Host Configuration Security Profile Access "Call "HostImageConfigManager.QueryHostAcceptanceLevel" for object "imageConfigManager-252705" on vCenter Server "your vcenters fqdn" failed.

vmwarning.log "2014-10-14T20:20:53.668Z cpu39:46241)WARNING: Heap: 4128: Heap_Align(globalCartel-1, 136/136 bytes, 8 align) failed. caller: 0x41801ffe5429"

Avamar 7.01 Backup and Restore Failure, Restore Error 10011 Failed to write to disk, Backup failure Snapshot failure,

Recommended Remediation

1) Lower memory overhead on effected hosts by shutting off 25 % or more of VMs

2) Login to Tech support mode

3) Run the following commands

•Stop the service via SSH or the ESXi Shell: /etc/init.d/hp-ams.sh stop

•Disable the service at boot via SSH or the ESXi Shell: chkconfig hp-ams.sh off

< At this point host operations should return to normal, backups, vmotion>

No immediate host reboot has been required thus far

Long Term:

I'm trying to work with HP to identify a Solid Remediation process.

It has been stipulated by HP's support team that you must remove the vib and reboot prior to updating the AMS. What may be happening is that not all effectid components are removed and simply applying the subsequent patch or full bundle may not refresh all components

Here's a list of resources on the subject matter thus far.

VMWare KB2085618

HP mmr_kc-0120902  http://tinyurl.com/mxkt9es

HP FW http://tinyurl.com/k6hebz4

0 Kudos
stephenrbarry
Enthusiast
Enthusiast

It's a known issue with HP AMS versions 9.6 or 10.0, and is fixed with 10.0.1.  Isn't the long term solution to upgrade to the latest version?

Link to HP KB: http://h20000.www2.hp.com/portal/site/hpsc/template.PAGE/public/psi/mostViewedDisplay/?sp4ts.oid=521...

0 Kudos
ChicaneUK
Enthusiast
Enthusiast

Through extremely weird coincidence we ran into this exact same problem last night - thought it was about to take down an entire cluster. HP's damn software strikes again. As has been said, an update to the AMS package fixes this. Or stop and disable it if you don't need it.

0 Kudos
lakey81
Enthusiast
Enthusiast

That's interesting you said you saw symptoms after 82 days.  We also saw issues around 82-84 days of up time but it seems the problem gets progressively worse from when you first notice stuff to day 84.  If you wait long enough your only option will be to shutdown all VMs on the host and reboot because vMotion will be broke, starting services won't work, can't login to the console, can't ssh, so you really can't do anything to resolve the problem.

Also this only seemed to affect G8s, none of my G7s were affected.

0 Kudos
jonathanjabez
Hot Shot
Hot Shot

0 Kudos
skemparaju
VMware Employee
VMware Employee

Try this. Issue: If a running VM is reloaded, or if hostd is restarted with VMs running, UI would no longer be able to acquire new MKS/device/guestControl tickets for those VMs, and thus the user would not be able to open the consoles or interact with the guest through the UI. Fix: When connecting to a running VM on ESX (e.g. when reloading), hostd calls VigorClient_ConnectOnlineLocal to establish the connection. That function in turn calls VigorClientCOnnect with the "wantMks" paramter always set to FALSE, which causes the mks/ sub-tree to be skipped when mounting VMDB. Since hostd on ESX would always want to mount the mks/ sub-tree, this change simply removes the conditional that allows skipping mks/. This is considered safe on vsphere55 because hostd always wants to mount "mks/". That also matches the vsphere51 behavior, where Foundry always mount "mks/" when connecting to a running VM.

0 Kudos
dkhleon
Contributor
Contributor

I encounter the same issue with VM Console access errors / MKS connection terminated by server....  I found out that if I am running any snapshot on any VM this error will occur.  I deleted all my snapshots and disable snapshot that resolved the issue with VM Console access errors.  I hope that answer your questions.

0 Kudos
selva84
Contributor
Contributor

Download the HP-ams offline bundle for your hardware model & ESXi version and install the updated hp-ams vib on your esxi host using the below command:

esxcli software vib install -d /datastore/directory/hp-ams-esxi5.5-bundle-9.5.0-15.zip

Reboot is required for the changes to take effect. This updated hp-ams VIB stops the excessive logging to the hpHelper.log file. Once it done, you will be able to access your virtual machines console without any issues.

For more information : http://www.vmwarearena.com/2014/04/unable-to-connect-to-mks-connection-terminated-by-server-on-hp-ha...

Author: Mohammed Raffic.

0 Kudos
mikefsc
Contributor
Contributor

I'm late to this party, but am experiencing the same issues (unable to connect to the MKS during console screen, errors when starting SSH/ESXi Shell) as others in this post 82 days after upgrading my HP DL380p G8s using the HP custom ISO for 5.5 Update 1 and 5.5 Update 2.  I followed selva84 instructions and VMware KB article 2085618.  My hp-ams vib versions were 550.9.6.0-12.1198610 and 550.10.0.0-18.1198610.  I upgraded to 550.10.0.1-07.1198610 after opening a case with VMware using the esxcli commands.  The update was successful.  I will begin this on my other hosts and wait 82 days to see what happens.

0 Kudos
paolomic
Enthusiast
Enthusiast

I wouldn't hold your breath waiting for HP to fix this.  This has been ongoing since a year and a half when we purchased 8 x DL360 G8p servers.  Every subsequent release of the HP-ASM VIB package doesn't seem to fix anything.  In fact the update 2 ISO that I applied to my nodes at Xmas has made things even worse.  Almost every 2 weeks we lose the consoles.  What is far more troubling to me, is my backup software Commvault, will not backup the VM's if this problem persists as it cannot open the .vmx file to create the snapshot to perform the backup.

So not only do i have this mess of an error, I have to check the backup logs every single day to see if all jobs ran, reboot or restart agents and re-run the backup jobs.  Its very very annoying.

HP support is useless, they tell me to go to VMware, even though it's their drivers causing issues

0 Kudos
RGuinazzo
Contributor
Contributor

I know this thread is old.  This is to cover anyone that may encounter this issue and find this thread.  Josh below had the problem correct and HP does have an answer.

This is a link to the HP problem description and the associated hot fix

HP Support document - HP Support Center

Title: HP ProLiant Gen8 Server - HP-AMS 9.6 Memory Leak Issue

Object Name: mmr_kc-0120902

Document Type: Support Information

Original owner: KCS - ProLiant Servers

Disclosure level: Public

Version state: final

Environment

FACT:HP ProLiant Gen8 Server Blade
FACT:Vmware ESXi 5.x configured in HA & DRS cluster
FACT:HP-AMS 9.6
levirogers
Contributor
Contributor

EDIT:  Apparently the link is bad............Sigh.

You sir a fucking saint.  Thanks for posting the fix.  We just noticed this on some hosts and I was at a loss for what the issue was.  Thanks for following up with a link to the fix.

Message was edited by: levirogers bad link.

0 Kudos
mikefsc
Contributor
Contributor

Just an update.  I am no longer having any issues on the hosts, so the updated VIB worked for me.

0 Kudos