VMware Cloud Community
Roxx16200
Contributor
Contributor

Host (not responding)

I have 3 nodes in my cluster. 1 of them suddenly shows as "not responding."

When I restart the vmware-vpxa service, it comes back online briefly and goes right back offline. The vpxa.log is very busy. Only holding about 5 mins of data before overwriting (not sure if this is normal) but I don't see any blaringly obvious errors in it.

My /etc/hosts, /etc/sysconfig/network, and /etc/vmware/esx.conf files look correct. However, when I run./ft_gethostbyname out of the /opt/vmware/aam/bin directory, it returns ft_gethostbyname(INIT) FAILED!

I'm sure this is part of the problem but don't know how to fix it. I'm stuck at the moment because 6 of my production servers are still running on this ESX node....despite the errors.

Any help greatly appreciated.

Reply
0 Kudos
33 Replies
admin
Immortal
Immortal

The vpxa agent suffered some kind of error and backtraced. This more than likely would have caused it to disconnect from VC.

Does this time match with the disconnection? Check for the string "*Backtrace*" in all the vpxa logs on all servers. Does it's logged time match the time the servers disconnect?

From the vpxa log:

Increment master gen. no (67): VpxaAlarm::CheckAlarmStatus

Unhandled exception: Not initialized

Backtrace:

eip 0x909dd92

eip 0x9043444

eip 0x907f8b5

eip 0x8bab763

eip 0x897eaa2

eip 0x8c886d4

eip 0x8e26a99

eip 0xb9fdd8

eip 0x276fca

Some errors in the VC logs around the same time:

UpdateRuntimeData: Invalid managed entity status 93353016

UpdateRuntimeData: Invalid managed entity status 93346240

.......

-- FINISH task-internal-2 -- -- ScheduledTaskManager

VpxLro::LroMain took 773676173 ms

Also a DB error (not sure if its related):

2008-08-28 13:57:42.725 'App' 5796 error VpxdMoHost::CollectRemote database error while flushing stats data: "ODBC error: () - " is returned when executing SQL statement "{ call load_stats_proc(?, ?, ?, ?, ?, ?) }"

That ScheduledTaskManager task took 773676173 ms! That can't be right.

Both the backtrace on the ESX host and the errors on the VC seem to indicate a problem checking alarm status.

First check the rest of the vpxa logs to see if the backtrace corresponds to the disconnect times.

Also, attach/paste the hostd log from the same time. They're stored in /var/log/vmware/ and start with hostd.

You may want to consider opening a Support Request with VMware support.

Message was edited by: appk

Reply
0 Kudos
Roxx16200
Contributor
Contributor

vmkping doesn't miss a beat on either server. ?:|

I checked the vpxa files but they do not go back any further than the one I posted. They've been overwritten.

Hostd logs seem to have a problem.

At 13:45:10 my junior tech connected into the VirtualCenter via web interface and hit the RESET button on the MAIL02 virtual server. Then the log loops. It was running on ESX2. (full log attached as esx2.hostd.log). The server eventually appeared in my VCenter as (not responding).

Config target info loaded

Hw info file: /etc/vmware/hostd/hwInfo.xml

Config target info loaded

Task Created : haTask-80-vim.VirtualMachine.reset-23032

Event 92 : Mail02 on esx2.netsource-one.local in ha-datacenter is reset

State Transition (VM_STATE_ON -> VM_STATE_RESETTING)

DISKLIB-VMFS : "/vmfs/volumes/482aeccf-346c6aa7-db28-001d092b6b6f/Mail02/Mail02-flat.vmdk" : open successful (17) size = 18269395968, hd = 0. Type 3

DISKLIB-VMFS : "/vmfs/volumes/482aeccf-346c6aa7-db28-001d092b6b6f/Mail02/Mail02-flat.vmdk" : closed.

Disconnect check in progress: /vmfs/volumes/482aeccf-346c6aa7-db28-001d092b6b6f/Mail02/Mail02.vmx

DISKLIB-VMFS : "/vmfs/volumes/482aeccf-346c6aa7-db28-001d092b6b6f/Mail02/Mail02-flat.vmdk" : open successful (17) size = 18269395968, hd = 0. Type 3

DISKLIB-VMFS : "/vmfs/volumes/482aeccf-346c6aa7-db28-001d092b6b6f/Mail02/Mail02-flat.vmdk" : closed.

Event 93 : Mail02 on esx2.netsource-one.local in ha-datacenter is powered on

State Transition (VM_STATE_RESETTING -> VM_STATE_ON)

Task Completed : haTask-80-vim.VirtualMachine.reset-23032

Received a duplicate transition from foundry: 1

Task Created : haTask-pool0-vim.ResourcePool.updateChildResourceConfiguration-23036

State Transition (VM_STATE_ON -> VM_STATE_RECONFIGURING)

State Transition (VM_STATE_RECONFIGURING -> VM_STATE_ON)

Task Completed : haTask-pool0-vim.ResourcePool.updateChildResourceConfiguration-23036

Task Created : haTask-pool0-vim.ResourcePool.updateChildResourceConfiguration-23040

State Transition (VM_STATE_ON -> VM_STATE_RECONFIGURING)

State Transition (VM_STATE_RECONFIGURING -> VM_STATE_ON)

Task Completed : haTask-pool0-vim.ResourcePool.updateChildResourceConfiguration-23040

Task Created : haTask-pool0-vim.ResourcePool.updateChildResourceConfiguration-23042

State Transition (VM_STATE_ON -> VM_STATE_RECONFIGURING)

State Transition (VM_STATE_RECONFIGURING -> VM_STATE_ON)

Task Completed : haTask-pool0-vim.ResourcePool.updateChildResourceConfiguration-23042

Ticket issued for mks connections to user: vpxuser

Task Created : haTask-pool0-vim.ResourcePool.updateChildResourceConfiguration-23044

State Transition (VM_STATE_ON -> VM_STATE_RECONFIGURING)

State Transition (VM_STATE_RECONFIGURING -> VM_STATE_ON)

Task Completed : haTask-pool0-vim.ResourcePool.updateChildResourceConfiguration-23044

etc....

Then, in an attempt to recreate the problem, I did the same thing to MAIL03 virtual server which was hosted on ESX1. Same problem occurred: (full log attached esx1.hostd.log)

Task Created : haTask-64-vim.VirtualMachine.reset-59360

Event 192 : Mail03 on esx1.netsource-one.local in ha-datacenter is reset

State Transition (VM_STATE_ON -> VM_STATE_RESETTING)

DISKLIB-VMFS : "/vmfs/volumes/482aeccf-346c6aa7-db28-001d092b6b6f/Mail03/Mail03-flat.vmdk" : open successful (17) size = 13860645888, hd = 0. Type 3

DISKLIB-VMFS : "/vmfs/volumes/482aeccf-346c6aa7-db28-001d092b6b6f/Mail03/Mail03-flat.vmdk" : closed.

Disconnect check in progress: /vmfs/volumes/482aeccf-346c6aa7-db28-001d092b6b6f/Mail03/Mail03.vmx

DISKLIB-VMFS : "/vmfs/volumes/482aeccf-346c6aa7-db28-001d092b6b6f/Mail03/Mail03-flat.vmdk" : open successful (17) size = 13860645888, hd = 0. Type 3

DISKLIB-VMFS : "/vmfs/volumes/482aeccf-346c6aa7-db28-001d092b6b6f/Mail03/Mail03-flat.vmdk" : closed.

Event 193 : Mail03 on esx1.netsource-one.local in ha-datacenter is powered on

State Transition (VM_STATE_RESETTING -> VM_STATE_ON)

Task Completed : haTask-64-vim.VirtualMachine.reset-59360

Task Created : haTask-ha-root-pool-vim.ResourcePool.updateConfig-59363

Task Completed : haTask-ha-root-pool-vim.ResourcePool.updateConfig-59363

Task Created : haTask-pool0-vim.ResourcePool.updateChildResourceConfiguration-59364

State Transition (VM_STATE_ON -> VM_STATE_RECONFIGURING)

State Transition (VM_STATE_RECONFIGURING -> VM_STATE_ON)

Task Completed : haTask-pool0-vim.ResourcePool.updateChildResourceConfiguration-59364

Task Created : haTask-ha-root-pool-vim.ResourcePool.updateConfig-59365

Task Completed : haTask-ha-root-pool-vim.ResourcePool.updateConfig-59365

Task Created : haTask-pool0-vim.ResourcePool.updateChildResourceConfiguration-59366

State Transition (VM_STATE_ON -> VM_STATE_RECONFIGURING)

State Transition (VM_STATE_RECONFIGURING -> VM_STATE_ON)

Task Completed : haTask-pool0-vim.ResourcePool.updateChildResourceConfiguration-59366

Task Created : haTask-ha-root-pool-vim.ResourcePool.updateConfig-59367

Task Completed : haTask-ha-root-pool-vim.ResourcePool.updateConfig-59367

Task Created : haTask-pool0-vim.ResourcePool.updateChildResourceConfiguration-59368

State Transition (VM_STATE_ON -> VM_STATE_RECONFIGURING)

State Transition (VM_STATE_RECONFIGURING -> VM_STATE_ON)

Task Completed : haTask-pool0-vim.ResourcePool.updateChildResourceConfiguration-59368

Task Created : haTask-ha-root-pool-vim.ResourcePool.updateConfig-59369

Task Completed : haTask-ha-root-pool-vim.ResourcePool.updateConfig-59369

Task Created : haTask-pool0-vim.ResourcePool.updateChildResourceConfiguration-59370

State Transition (VM_STATE_ON -> VM_STATE_RECONFIGURING)

State Transition (VM_STATE_RECONFIGURING -> VM_STATE_ON)

Task Completed : haTask-pool0-vim.ResourcePool.updateChildResourceConfiguration-59370

Task Created : haTask-ha-root-pool-vim.ResourcePool.updateConfig-59371

Task Completed : haTask-ha-root-pool-vim.ResourcePool.updateConfig-59371

Task Created : haTask-pool0-vim.ResourcePool.updateChildResourceConfiguration-59372

State Transition (VM_STATE_ON -> VM_STATE_RECONFIGURING)

State Transition (VM_STATE_RECONFIGURING -> VM_STATE_ON)

Task Completed : haTask-pool0-vim.ResourcePool.updateChildResourceConfiguration-59372

.....looping from there.

To bring them both back, I had to reset the vpxa service on the host which brought it online in VC long enough to initiate a maintenance mode command, which vmotioned the live VM's off to another node. Then I rebooted the host and it came back online.

Reply
0 Kudos
JonRoderick
Hot Shot
Hot Shot

Ok, time for an update from me...

after getting our support team to analyse the VC logs etc, I noticed that the person who built the host had forgotten to reboot it after changing the SC memory to 800MB (so it was still running at 200MB).

Corrected this and havent' had an instance of the 'not responding' issue in the last 5 hours.

Aside from this, the support guy did notice the logs were overflowing and wrapping very quickly because of 1 particular VM with lots of entries like the following:

State Transition (VM_STATE_RECONFIGURING -> VM_STATE_ON)

Similar to yours perhaps?

Jon

Reply
0 Kudos
Roxx16200
Contributor
Contributor

Thanks for the input, John. I definitely have rebooted since changing the SC to 800mb. In fact, prior to my reload of ESX, I had not even altered the SC memory and was still having similar problems. I suspect I'm having a problem with my Virtual Center....but not sure at this point. Might also be reloaded soon if we can't find anything else.

Reply
0 Kudos
admin
Immortal
Immortal

The hostd logs is full of RECONFIGURING events which is not normal.

(VM_STATE_ON -> VM_STATE_RECONFIGURING)

(VM_STATE_RECONFIGURING -> VM_STATE_ON)

Does the problem go away when you disable DRS and remove all resource pools?

This appears to be a known issue. You should open a support request with vmware support.

Message was edited by: appk

Roxx16200
Contributor
Contributor

The problem DOES go away with the DRS removed.

My VMWare environment is NFR (and non-production but used for training/testing). I don't believe I can open a case with VMWare.

Reply
0 Kudos
admin
Immortal
Immortal

What appears to be happening is that all these events (which shouldn't be happening) are taking up all the resources for the vpxa agent.

The vpxa agent therefore cannot heartbeat the VC server so it shows as "not responding" in VC.

I believe VMware are working on a patch so keep an eye out for new patches/releases. In the meantime you may want to keep DRS disabled.

Reply
0 Kudos
JonRoderick
Hot Shot
Hot Shot

Hmm, that's a big leap from a random problem to it being a known issue with a potential patch on the way. Disabling DRS isn't an option for me and I'm looking to go into PROD with U2 week after next....am I doing the right thing I wonder?

Jon.

Reply
0 Kudos
azn2kew
Champion
Champion

Well since this is a training machines, can you quickly reload everything in 2 hours instead of go through this long troubleshooting procedures. All the possibilities has been applied but has not fix. Why can't we test out with new rebuilds?

1. Download your latest packages and builds for both VC 2.5 u2 and ESX 3.5 U2

2. Backup your virtual center database.

3. Create a new virtual machine and load VC server components on there and point to the database location/credentials.

4. Migrate all VMs to other ESX hosts and reinstall fresh ESX 3.5 U2 and use Update Manager to patch it.

5. Configure all TCP/IP networking and LUN pieces.

6. Migrate back all the VMs to ESX 3.5 U2 and do the same to the rest of ESX hosts remaining.

7. Create a new cluster and enable HA, DRS and VMotion and drag and drop the ESX hosts to it.

8. Check to see if you have any problems.

If you want to continue troubleshooting:

1. Make sure static DNS entries are in place.

2. Make sure your root partition not filled up.

3. Check your database make sure it doesn't filled up.

4. Restart vpxa and hostd process

5. Completed remove VPX agent and readd your host.

6. Disconnect and remove all your ESX hosts and create brand new cluster.

7. Add each ESX to new cluster one by one.

8. Verify you have use the latest patches for your ESX host.

9. Check all the logs for any specific issues and tackle from there again.

10. Time to call VMware Support since this is mind boggling. I'm surprise if you reload ESX 3.5 U2 from scratch didn't work there are definitely having issues with your end. Check your networking configuration.

If you found this information useful, please consider awarding points for "Correct" or "Helpful". Thanks!!!

Regards,

Stefan Nguyen

iGeek Systems Inc.

VMware, Citrix, Microsoft Consultant

If you found this information useful, please consider awarding points for "Correct" or "Helpful". Thanks!!! Regards, Stefan Nguyen VMware vExpert 2009 iGeek Systems Inc. VMware vExpert, VCP 3 & 4, VSP, VTSP, CCA, CCEA, CCNA, MCSA, EMCSE, EMCISA
Reply
0 Kudos
Roxx16200
Contributor
Contributor

AppK

I have to agree with John. Don't get me wrong....I GREATLY appreciate your help! But being that this is a known issue, can you offer up an article that states this so I can track it closely and know when a fix is ready? I'm concerned I will come across this with any new installs for my clients.

Azn2kew

All ESX nodes were already rebuilt with no change. I'm already rebuilding the VC currently and will post the results shortly.

Reply
0 Kudos
admin
Immortal
Immortal

I'm not aware of the full details of the issue or how its triggered. i.e. it could be something specific with the config thats causing the issue.

If delete the cluster then recreate it does the problem reoccur?

If not, if you remove then reinstall VC, wiping the database, does it reoccur?

Reply
0 Kudos
admin
Immortal
Immortal

I think it's a good idea to do the full reinstall of VC and ESX with a fresh database to see if it reoccurs.

I don't think there is an article on this, or at least none that I have found.

The reason why I think its a known issue is that a colleague of mine has come across this same sort of behavior and is investigating it with vmware support.

Naturally its possible that I'm incorrect and that there is something specific about your cluster or hosts which is causing this problem.

Reply
0 Kudos
mike_laspina
Champion
Champion

Within hostd.log 1 the esx host process is trying to send the updated VM resource pool information to the calling process which was a web session outside of the managed VC process and it ofcourse dies with a pipe broken message since the web session does not have the VC behind it. So I am not sure this is a practical thing to do to find the fault, however this behavior is a bug to be corrected. I don't think it is the cause but more a interuption in the normal process flow.

What does strike me as significant is the ODBC call failure. This is important as the esx host needs send the collected vmdb stats back to the VC DB at regular intervals.

I would have a closer look at the VC DB health and performance.

http://blog.laspina.ca/ vExpert 2009
Reply
0 Kudos
Roxx16200
Contributor
Contributor

Well....

I have completed the reinstallation and reconfiguration of VCenter and it seems to have done the trick. I can now shutdown/restart a virtual machine from the console AND from the web interface.

Time will tell if it holds true.

Thanks to everyone for the input on this problem. I'll update this thread if a problem resurfaces.

Reply
0 Kudos