VMware Cloud Community
maskham
Contributor
Contributor

vSphere 5.5 VM console access errors / MKS connection terminated by server & MKS malformed repsonse from server

I have just rolled out vCloud 5.5 and vSphere 5.5 and have struck an issue that re-occurs every 13 days after a physical host reboot.

When trying to connect to a VM console using either [vCD/Web Client/c# Client] I get the following errors:

1. Blank Screen

2. Occasionally MKS malformed response from server

3. Unable to connect to the MKS: Connection terminated by server

I have tried Migrating the VM to another host made no difference, restarting the management agents on the host made no difference, Rebooting the host

and migrating the VM back to host fixed the issue. Suspect a bug in vSphere 5.5 as all [3] host had identical problems. So far I have struck this issue twice and

my servers have been up for 7 days and expect the 3rd occurrence to occur 13/11/2013 ;(

Issue started 16/10 08:30am [first noticed]

Server uptime all [3] hosts 13 days

ESXi build 5.5.0 1331820


Issue re-occurred 29/10 15:00 [second time]

Server uptime all [3] hosts 13 days

ESXi build 5.5.0 1331820

 

Below is Syslog Server Capture when attempting a VM console connection.


<166>2013-10-15T23:08:39.722Z fcsesx01.fred.local Hostd: -->

<166>2013-10-15T23:08:39.722Z fcsesx01.fred.local Hostd: [3A160B70 verbose 'Hostsvc.ResourcePool pool18'] Added child 26 to pool

<166>2013-10-15T23:08:39.723Z fcsesx01.fred.local Hostd: [3A8D2B70 info 'Libs'] CnxAuthdConnect: Returning false because CnxAuthdProtoReadResponse2 failed

<166>2013-10-15T23:08:39.723Z fcsesx01.fred.local Hostd: [3A8D2B70 info 'Libs'] CnxConnectAuthd: Returning false because CnxAuthdConnect failed

<166>2013-10-15T23:08:39.723Z fcsesx01.fred.local Hostd: [3A8D2B70 info 'Libs'] Cnx_Connect: Returning false because CnxConnectAuthd failed

<166>2013-10-15T23:08:39.723Z fcsesx01.fred.local Hostd: [3A8D2B70 info 'Libs'] Cnx_Connect: Error message: Connection terminated by server

<166>2013-10-15T23:08:39.723Z fcsesx01.fred.local Hostd: [FFE40B70 info 'Vmsvc.vm:/vmfs/volumes/5248c6f0-6a442ffe-c35f-0017a47724d8/diskload (a33d96fa-47c9-4135-b05e-fd39ce56d980)/diskload (a33d96fa-47c9-4135-b05e-fd39ce56d980).vmx'] Foundry_[Create|Open]Ex failed: Error: (3008) Cannot connect to the virtual machine

<166>2013-10-15T23:08:39.723Z fcsesx01.fred.local Hostd: [FFE40B70 info 'Vmsvc.vm:/vmfs/volumes/5248c6f0-6a442ffe-c35f-0017a47724d8/diskload (a33d96fa-47c9-4135-b05e-fd39ce56d980)/diskload (a33d96fa-47c9-4135-b05e-fd39ce56d980).vmx'] Failed to load virtual machine.

<166>2013-10-15T23:08:39.723Z fcsesx01.fred.local Hostd: [FFE40B70 info 'Vmsvc.vm:/vmfs/volumes/5248c6f0-6a442ffe-c35f-0017a47724d8/diskload (a33d96fa-47c9-4135-b05e-fd39ce56d980)/diskload (a33d96fa-47c9-4135-b05e-fd39ce56d980).vmx'] Failed to load VM N5Vmomi5Fault11SystemError9ExceptionE(vmodl.fault.SystemError)

<166>2013-10-15T23:08:39.723Z fcsesx01.fred.local Hostd: [FFE40B70 info 'Vmsvc.vm:/vmfs/volumes/5248c6f0-6a442ffe-c35f-0017a47724d8/diskload (a33d96fa-47c9-4135-b05e-fd39ce56d980)/diskload (a33d96fa-47c9-4135-b05e-fd39ce56d980).vmx'] Marking VirtualMachine invalid

<166>2013-10-15T23:08:39.724Z fcsesx01.fred.local Hostd: [FFE40B70 info 'Vmsvc.vm:/vmfs/volumes/5248c6f0-6a442ffe-c35f-0017a47724d8/diskload (a33d96fa-47c9-4135-b05e-fd39ce56d980)/diskload (a33d96fa-47c9-4135-b05e-fd39ce56d980).vmx'] State Transition (VM_STATE_INITIALIZING -> VM_STATE_INVALID_CONFIG)

<166>2013-10-15T23:08:39.724Z fcsesx01.fred.local Hostd: [FFE40B70 verbose 'Vmsvc.vm:/vmfs/volumes/5248c6f0-6a442ffe-c35f-0017a47724d8/diskload (a33d96fa-47c9-4135-b05e-fd39ce56d980)/diskload (a33d96fa-47c9-4135-b05e-fd39ce56d980).vmx'] Time to load virtual machine: 51 (msecs)

<166>2013-10-15T23:08:39.724Z fcsesx01.fred.local Hostd: [FFE40B70 info 'Vmsvc'] Loaded virtual machine: /vmfs/volumes/5248c6f0-6a442ffe-c35f-0017a47724d8/diskload (a33d96fa-47c9-4135-b05e-fd39ce56d980)/diskload (a33d96fa-47c9-4135-b05e-fd39ce56d980).vmx

<166>2013-10-15T23:08:39.724Z fcsesx01.fred.local Hostd: -->


Any help would be appreciated.

I have also logged the call with HP as we have software support with them and they mentioned that a number of sites have similar issues and a call has been place with VMware for further investigation

Rgds

Mark Askham

Senior Infrastructure Architect

76 Replies
MR-Z
VMware Employee
VMware Employee

It should be called hpHelper, living under /opt/hp/hp-ams.

Reply
0 Kudos
Gaurav_Baghla
VMware Employee
VMware Employee

/etc/init.d/hp-ams.sh stop

To persist the change in the event of a host reboot, run the command:

chkconfig hp-ams.sh off

Regards Gaurav Baghla Opinions are my own and not the views of my employer. https://twitter.com/garry_14
Reply
0 Kudos
mstols
Contributor
Contributor

Also performed the commands that Gaurav_Baghla mentioned.

It only seems to affect our BL460c Gen8 hosts, BL490 G7 hosts have an uptime of 15 days, but don't seem affected.

We're also running vCloud and this issue affects vShield too, resulting in labs without network connectivity.

We're on the latest vmware patch, that doesn't solve this issue.

HP, fix this!

Reply
0 Kudos
stephenrbarry
Enthusiast
Enthusiast

I seem to have had a more severe crash of one of my hosts today.  I'm unable to SSH to the server (get a "connection refused" error), can't log into the direct console (just hangs without bringing up prompt), and can't vMotion any VM's off the host (get an error, "A general system error occurred.")

The VMKernel log is filled with errors like this:

2014-01-28T16:04:49.874Z cpu16:3061156)WARNING: VisorFSObj: 1940: Cannot create file /var/run/sfcb/52c0b47e-eb5b-d2ce-1bde-8f8a13ed2f13 for process sfcb-CIMXML-Pro because the inode table of its ramdisk (root) is full.

Anyone have any ideas how to stop the hp service without bringing down the host?  Of course there are several high profile virtual machines on it that can't incur any downtime...

I have a ticket open with VMWare support but their response time is proving horrible, over 24 hours so far.

Steve

Reply
0 Kudos
stephenrbarry
Enthusiast
Enthusiast

Anyone know if you can use PowerCLI to stop the HP AMS service somehow?  For some reason, I can connect to the host using PowerCLI.

Reply
0 Kudos
yuspino
Contributor
Contributor

Hello. I have the same problem (black\blank console in Vsphere client). In my case, were problems with DNS.

Try this: log in to the vSphere client against IP

Good luck!

Reply
0 Kudos
stephenrbarry
Enthusiast
Enthusiast

This is definitely not a DNS issue.

Reply
0 Kudos
Minc
Contributor
Contributor

i guess. this is not a DNS issue. stephenrbarry is right.

I have HP BL460G8 blade Server. and HP DL360 G7, Dell R710, Dell R410.

All Server Installed ESXi 5.5 and vCenter 5.5 Virtual Appliance.

But Only problem was BL460 G8 Model.

Reply
0 Kudos
abaack
Contributor
Contributor

Is anyone hearing any updates from HP or VMware regarding this?

And yes, while some users may be experiencing similar issues and can resolve them different ways - this specific issue appears to only impact HP Gen8 blades running HP's custom ISO.

Reply
0 Kudos
jhirsh
Contributor
Contributor

HP has finally confirmed to me that they were able to reproduce the error in their lab, but they haven't been able to isolate "why" yet.

Here's a recap of what I know:

  • Applies to Gen8 servers only
  • After approximately two weeks of uptime, errors will start to be recorded in /var/log/hpHelper.log, starting with "AgentX master agent failed to respond to ping.  Attempting to re-register.". These errors will typically be recorded every 15 to 30 minutes up until a certain point.
  • Around this time the MKS errors are presented and you can no longer open console on a guest
  • After another amount of time, SSH connections are excepted, but terminate prior to being able to enter the password (this is probably when the hpHelper.log stops adding new entries)
  • If the ESXi Shell is not enabled, attempting to enable it will silently fail (the process will never spawn).
  • Various cron jobs on the ESXi host will also fail to spawn due to lack of resources. I haven't been able to ascertain what this affects, though.

On one particular server, I had stopped and restarted the hp-ams.sh service after it had bugged, but the issue never came back until a reboot (+ the two week delay). This could just be coincidental, as I only did it on one host.

Corrective actions:

Cheers

-Joshua

Reply
0 Kudos
AndyRawcs
Contributor
Contributor

I removed the hp-ams service before my last reboot and today is day 14 and the issue has not come back.

Hp told me to run the following commands from the console:

esxcli software vib list

(You will see the hp-ams service listed)

esxcli software vib remove -n hp-ams

(Will remove the service and the system will then require a reboot)

HP support are due to come back to me tomorrow to check if that has worked, they will insist that this is a VMware issue and that I need to log the problem with them. I guess that is the next step but fear I am going to get stuck in between HP and VMware support who will end up blaming each other.

I will keep you all updated if I get any more information from either

Reply
0 Kudos
joshswain
Contributor
Contributor

HP just released an AMS update 9.5.0 (18 Feb 2014) along with other driver updates. I'll give it a go this week.

The following fix has been made to HP Agentless Management Service:

  • Fixed excessive logging of elxnet warnings in vmkernel log when an Emulex NIC/CNA and a QLogic CNA configured as NIC-iSCSI are installed on the system
  • Fixed Memory Leak
  • Fixed issue of not reporting ATA Disk Drive Status Change trap
Reply
0 Kudos
stephenrbarry
Enthusiast
Enthusiast

I just looked at my downloads and saw that HP released an updated customized ISO today.  Looks like this could finally be the solutions we've been waiting for...

Reply
0 Kudos
JeffDe
Contributor
Contributor

Don't ever let HP give you the runaround.  If you have purchased your support from HP they are obligated to provide you with a solution. They have their own internal contacts with VMware and can work with them to find a solution for you.  My reseller taught me that I am sometimes too polite.  Let them know you are dissatisfied with their level of customer service and that you would like to speak with their supervisor.  There is no need to be caught in the middle. 

Reply
0 Kudos
BluScreened2011
Contributor
Contributor

My environment started acting similar after upgrading to vSphere 5.5.  But I don't get the MKS error that often, I just get a black screen on the console tab. We have all DELL hardware, no HP whatsoever. I can be working fine on a VM in the console tab, then click on another VM, click back and it's blank and never comes back. The only way to get it back is to right-click and "Open Console" then the console in the tab will reappear again.  Very annoying.

Reply
0 Kudos
BigBjorn
Enthusiast
Enthusiast

15 days ago, I updated my Gen8 hosts with AMS v9.5.0-15. The problem has not come back.

Reply
0 Kudos
jhirsh
Contributor
Contributor

Hey BigBjorn/joshswain/whomever, are you still having no issues with 9.5.0? My test instance has just reached 14 days, so I'm still a bit early to confirm if it's resolved.

Thanks

-Joshua

Reply
0 Kudos
BigBjorn
Enthusiast
Enthusiast

Everything still works for me.

Reply
0 Kudos
joshswain
Contributor
Contributor

No problems here on hosts at the 14 day mark since the last firmware flash and driver updates to match the 2014.02.0 SPP.

Reply
0 Kudos
CelebrationBran
Contributor
Contributor

Hello Guys and Ladies,

I have VMware vSphere running on IBM ThnkServer RD640. And every day i have to restart the actual host server because am not able to connect to my guest web server or any other server on the host system.

Reply
0 Kudos